How to export data into cdv from GemFire - gemfire

I am trying to build a ETL process that extracts data out to GemFire and load in Teradata. However, I am not finding a good mechanism to export data out. The only thing I have found so far is rest api that gets all entries from a region. However, is this good for bulk export ? It will give data back in json which has to be parsed before loading in table, which I assume won't be very performant for large volume of data. Is there any other solution to this ? Like exporting data as csv from GemFire? Or ODBC/JDBC connection to GemFire ? I found both the bulk export and ODBC/JDBC in Gemfire XD documentation but not in core GemFire? So are they not supported in core GemFire? What is the difference between core GemFire and the XD version ?

These are two different products designed with different things in mind. GemFire XD provides a low-latency SQL interface to in-memory table data, so it's generally used as an in-memory RDBMS. GemFire, on the other hand, is an in-memory data grid with "no restrictions" regarding the data you insert into the regions, you basically deal with custom java objects, not with tables. Also, I know GemFire XD was built on top of GemFire in the past, not sure what's the current status of that (you might want to have a look at Snappy Data for more details).
That said, and strictly speaking of GemFire, you can export snapshots of your regions and import them afterwards into another cluster. Even better, you can read a snapshot entry by entry for further processing or transformation into other formats, which I believe is exactly what you're looking for. Please have a look at Cache and Region Snapshots to get the details.
Hope this helps. Cheers.

Related

Accessing Spark Streaming Data pipelines. What option works best?

I am looking for the best option to access data from Spark data pipelines. The scenario is as follows:
I am reading data from Kafka topics, creating a streaming dataframe which is then cleaned and being printed on the console. I need this data to be integrated with existing Python scripts which is doing all the data operations by Pandas. I have considered the following options:
Write streaming data to local memory (e.g. Hive Tables).
Use Spark Structured Streaming ForeachBatch Sink.
I want to mention that the data is to be read after a certain interval and there will be a real time data dashboard in the future with this data.
Please advise which will be the best approach to handle this scenario. Apologies if the question sounds too basic. Thanks in advance.
If you save data on Hive each time before accessing the newly streamed data through python scripts, the newly added hive partitions are required to be refreshed each time as well.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Here are some disadvantages of having a hive for mentioned real-time scenarios.
https://www.quora.com/What-are-some-disadvantages-of-Apache-Hive#
Whereas, Using Spark Structured Streaming looks a better choice for the near-real-time experience.
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html

Hortonworks: Hbase, Hive, etc used for which type of data

I would like to ask if anyone could tell me or refer me to an internet page which describes all possibilities to store data in an apache hadoop cluster.
What I would like to know is: Which type of data should be stored in which "system". Under type of data I mean for example:
Live data (realtime)
Historical data
Data which is regularly accessed from an application
...
The complete question is not reduced on Hbase or Hive ("System") but for everything which is available under Hdp.
I hope someone could lead me in a direction where i could find my answer. Thanks!
I can give you an overview, but rest of the things you have to read on your own.
Let's begin with the types of data you want to store in HDFS:
Data in Motion(Which you denoted as real-time data).
So, how can you fetch the real-time data? Is it even possible? The answer is NO. There will always be a delay. However, we can reduce the downtime and processing time of the data. For which we have HDF(Hortonworks Data Flow). It works with the data in motion. There are many services providing the real-time data streaming. You can take the example of Kafka, Nifi, Storm and many more. These tools are used to process the data. You also need to store the data in such a way that you'd be able to fetch it no time(~2 sec), for that we use HBase. HBase stores the data in the columnar structure.
Data at rest (Historic/Data stored for future use)
So, to store the data at rest, there are no such issues. HDP(Hortonworks Data Platform) is there providing us the services to ingest, store and process the data. Even we can integrate HDF services to HDP(prior to version 2.6), which makes it easier to process Data in motion also. Here we need Databases to store a large amount of data. However, we are provided with HDFS(Hadoop Distributed File System) which can help us store any kind of data. But we don't ONLY want to store our data, we want to fetch it no time when it is required. So, how are we planning to do that? By storing our data in a structured form. For which we are provided Hive and HBase. To store such amount of data which is in TB, we need to run heavy processes that are where MapReduce, YARN, Spark, Kubernetes, Spark comes in to picture.
This is the basic idea of storing and processing data in Hadoop.
Rest you can always read from the internet.

How to handle data from an external, independent data source with Pivotal GemFire?

I am new to GemFire.
Currently we are using an MySQL DB and would like to move to GemFire.
How to move the existing data stored in MySQL over to GemFire? I.e., is there any way to to import existing MySQL data into GemFire?
There are many different options available for you to migrate data from 1 data store (e.g. an RDBMS like MySQL) to an IMDG (e.g. Pivotal GemFire). Pivotal GemFire does not provide any tools for this purpose OOTB.
However, you could...
A) Write a Spring Batch application to migrate all your data from MySQL to Pivotal GemFire in 1 large swoop. This is typical for most large-scale conversion processes, converting from 1 data store to another, either as part of an upgrade or a migration.
The advantage of using Pivotal GemFire as your target data store is that it stores Java Objects. So, if you are, say, using an ORM tool (e.g. Hibernate) to map the data stored in your MySQL database tables back to your application domain objects, you can then immediately and simply turnaround and store those same Objects directly into a corresponding Region in Pivotal GemFire. There is no additional mapping required to store an Object into GemFire.
Although, if you need something less immediate, then you can also...
B) Take advantage of Pivotal GemFire's CacheLoader, and maybe even the CacheWriter mechanisms. The CacheLoader and CacheWriter are implementations of the "Read-Through" and "Write-Through" design patterns.
More details of this approach can be found here.
In a nutshell, you implement a CacheLoader to load data from some external data source on Cache miss. You attach, or register the CacheLoader with a GemFire Region when the Region is created. When a Key (which can correspond to your MySQL Table Primary Key) is requested (Region.get(key)) and an entry does not exist, then GemFire will consult the CacheLoader to resolve the value, providing you actually registered a CacheLoader with the Region.
In this way, you slowly build up Pivotal GemFire from the MySQL RDBMS based on need.
Clearly, it is quite likely Pivotal GemFire will not be able to store all the data from your RDBMS in "memory". So, you can enable both Persistence and Overflow [to Disk] capabilities. By enabling Persistence, GemFire will load the data from it's own DiskStores the next time the nodes come online, assuming you brought them down prior.
The CacheWriter mechanism is nice if you want to run both Pivotal GemFire and MySQL in parallel for while, until you can shift enough of the responsibilities of MySQL over to GemFire, for instance. The CacheWriter will write back to your underlying MySQL DB each time an entry is written or updated in the GemFire Region. You can even do this asynchronously (i.e. "Write-Behind") using GemFire's AsyncEventQueues and Listeners; see here.
Obviously, you many options at your disposal. You need to carefully way your options and choose an approach that best meets your application requirements and needs.
If you have additional questions, let me know.

What are the options to bulk/batch load data into Apache Geode(Gemfire)?

We need to load millions of key/values into Apache Geode and we'd like to know what are some the options available. Our values happen to be in the 256kb range.
There are several options depending on your application requirements/SLAs or whether you need to perform conversion or other transformations, etc.
Out-of-the-box, Apache Geode provides the Cache & Region Snapshot Service. This is useful when you want to migrate data from 1 existing Apache Geode cluster to another, for instance. Not so useful if your data is coming from an external source, like a RDBMS.
Another option is to lazily load the data based on need. This can be accomplished by implementing the CacheLoader interface and registering the CacheLoader with a Region. Obviously, you could create a CacheLoader implementation that intelligently loads a block of data based on some rules/criteria in addition to loading and returning the single value of interests based on the current requests.
A lot of times, users create an external, custom Conversion process or tool to extract, transform and bulk load (ETL) a bunch of data into Apache Geode. This is typical in complex Use Cases or requirements. However, it is highly advisable to use perhaps a framework/tool like...
Spring XD (now Spring Cloud Data Flow on Pivotal's Cloud Foundry (PCF)) is great ETL tool and pipeline for creating stream-based applications. Spring XD / SCDF provides many different options for "sources" and "sinks" (e.g. GemFire Server). In addition to sources & sinks, you can even "tap" the stream to process the data with "Processors". So whether you are doing real-time stream or batch-oriented data operations (e.g. bulk loads), Spring XD is a great option.
I am sure Google might provide other answers on how to perform ETL with a KeyValue store like Apache Geode.
Hope this helps get you going.
Cheers,
John
We have very limited options to load Gemfire regions .
1) Spring batch:
Create Gemfire writer for load data and remove data
Create batch configuration and lod it
2) Apache Spark
https://www.linkedin.com/pulse/fast-data-access-using-gemfire-apache-spark-part-vaquar-khan-/

How to migrate Redis database to Aerospike?

We have a large redis database. The number of keys exploded recently as we have ~160M keys which take 50GB+ of RAM.
What would be the best migration strategy to move all this data from Redis to Aerospike? We are planning to use Jedis later so hopefully after the migration it will be as simple as pointing our services to a new port.
Ideally we can somehow import the dump.rdb file into Aerospike.
You need to put a little bit of extra work. Aerospike now supports Redis like list and map APIs. So, the migration will not be painful. However, you need to migrate your data and application.
To migrate data, you can export Redis data in csv format using the redis-cli utility and load it into aerospike using the aerospike csv loader utility. You can parallelize the loading if you split the data into multiple csv files.
To migrate the application, it's best to use aerospike native client library for better integration. You can pick language of your choice. You should find equivalent api for most of your needs. If you already abstracted the basic calls in your application, the migration should be even more smoother as there will be few places where you need to change the calls.