Apache Ignite: Сache API vs SQL - ignite

What should I use cache.put(key, value) or cache.query("INSERT INTO Table ")?

In case you properly configured queryable fields for your cache you can use both ways to insert data into the cache:
Key-Value API as shown here.
SqlFieldsQuery as described here.
Also, in case you would like to upload a large amount of data you can use Data Streamer, which automatically buffer the data and group it into batches for better performance.

Any. Or both.
One of the powers of Ignite is that it's truly multi-model - the same data is accessible via different interfaces. If you migrate a legacy app from an RDBMS, you'll use SQL. If you have something simple and don't care about the schema or queries, you'll use key-value.
In my experience, non-trivial systems based on Apache Ignite tend to use different kinds of access simultaneously. A perfectly normal example of an app:
Use key-value to insert the data from an upstream source
Use SQL to read and write data in batch processing and analytics
Use Compute with both SQL and key-value inside of the tasks to do colocated processing and fast analytics

Related

How can I load data from BigQuery to Spanner?

I'd like to run a daily job that does some aggregations based on a BigQuery setup. The output is a single table that I write back to BigQuery that is ~80GB over ~900M rows. I'd like to make this dataset available to an online querying usage pattern rather than for analysis.
Querying the data would always be done on specific slices that should be easy to segment by primary or secondary keys. I think Spanner is possibly a good option here in terms of query performance and sharding, but I'm having trouble working out how to load that volume of data into it on a regular basis, and how to handle "switchover" between uploads because it doesn't support table renaming.
Is there a way to perform this sort of bulk loading programatically? We already are using Apache Airflow internally for similar data processing and transfer tasks, so if it's possible to handle it in there that would be even better.
You can use Cloud Dataflow.
In your pipeline, you could read from BigQuery and write to Cloud Spanner.

What are the options to bulk/batch load data into Apache Geode(Gemfire)?

We need to load millions of key/values into Apache Geode and we'd like to know what are some the options available. Our values happen to be in the 256kb range.
There are several options depending on your application requirements/SLAs or whether you need to perform conversion or other transformations, etc.
Out-of-the-box, Apache Geode provides the Cache & Region Snapshot Service. This is useful when you want to migrate data from 1 existing Apache Geode cluster to another, for instance. Not so useful if your data is coming from an external source, like a RDBMS.
Another option is to lazily load the data based on need. This can be accomplished by implementing the CacheLoader interface and registering the CacheLoader with a Region. Obviously, you could create a CacheLoader implementation that intelligently loads a block of data based on some rules/criteria in addition to loading and returning the single value of interests based on the current requests.
A lot of times, users create an external, custom Conversion process or tool to extract, transform and bulk load (ETL) a bunch of data into Apache Geode. This is typical in complex Use Cases or requirements. However, it is highly advisable to use perhaps a framework/tool like...
Spring XD (now Spring Cloud Data Flow on Pivotal's Cloud Foundry (PCF)) is great ETL tool and pipeline for creating stream-based applications. Spring XD / SCDF provides many different options for "sources" and "sinks" (e.g. GemFire Server). In addition to sources & sinks, you can even "tap" the stream to process the data with "Processors". So whether you are doing real-time stream or batch-oriented data operations (e.g. bulk loads), Spring XD is a great option.
I am sure Google might provide other answers on how to perform ETL with a KeyValue store like Apache Geode.
Hope this helps get you going.
Cheers,
John
We have very limited options to load Gemfire regions .
1) Spring batch:
Create Gemfire writer for load data and remove data
Create batch configuration and lod it
2) Apache Spark
https://www.linkedin.com/pulse/fast-data-access-using-gemfire-apache-spark-part-vaquar-khan-/

Realtime queries in deepstream "cache" layer?

I see, that by using RethinkDB connector one can achieve real time querying capabilites by subscribing into specifically named lists. I assume, that this is not actually the fastest solution, as the query probably updates only after changes to records are written to the database. Is there any recommended approach to achieve realtime querying capabilites deepstream-side?
There are some favourable properties like:
Number of unique queries is small compared to number of records or even number of connected clients
All manipulation of records that are subject to querying is done via RPC.
I can imagine multiple ways how to do that:
Imitate the rethinkdb connector approach. But for that I am missing a list.listen() method. With that I would be able to create a backend process creating a list on-demand and on each RPC CRUD operation on records update all currently active lists=queries.
Reimplement basic list functionality in records and use the above approach with now existing .listen()
Use .listen() in events?
Or do we have list.listen() and I just missed it? Or there is more elegant way how to do it?
Great question - generally lists are a client-side concept, implemented on top of records. Listen notifies you about clients subscribing to records, not necessarily changing them - change notifications arrive via mylist.subscribe(data => {}) or myRecord.subscribe(data => {}).
The tricky bit is the very limited querying capability of caches. Redis has a basic concept of secondary indices that can be searched for ranges and intersection, memcached and co are to my knowledge pure key-value stores, searchable only by ID - as a result the actual querying would make most sense on the database layer where your data will usually arrive in significantly less than 200ms.
The RethinkDB search provider offers support for RethinkDB's built in realtime querying capabilites. Alternatively you could use MongoDB and trail its operations log or use PostGres and deepstream's built in subscribe feature for change notifications.

Sql query over Ignite CacheStore or over database

I am a beginner for Ignite, so I have some puzzles, one of which is as follows:when I try to query cache, whether it can look if memory contains or not. If not, then whether it will query database? If not,how to achieve such way?
Please help me if you know.Thx.
Queries work over in-memory data only. You can either use key access (operations like get(), getAll(), etc.) and utilize automatic read-through from the persistence store, or manually preload the data before running queries. For information on how effectively load large data set into the cache, see this page: https://apacheignite.readme.io/docs/data-loading

What is a recommended scalable DB platform to use in AWS for large amounts of volatile data sets - elasticsearch, Redis or DynamoDB?

Users of our platform will have large amounts of stored data on our system. Through an application, once connected, that data will be transferred to them and no longer need to remain on our servers. There could potentially be hundreds or thousands of users connected at any given time, performing their downloads.
Here's the proposed architecture:
User management, configuration, and data download statistics will be maintained in a SQL Server database, while using either Redis or DynamoDB for the large data sets.
The reason for choosing either Redis or DynamoDB is based on cost - cheaper than running another SQL Server instance, and performance. The data format will be similar to a datamart - flat table with no joins.
Initially the queries would be simple - get all data for user X between a date range, and optionally delete.
Since we may want to add free text searching for certain fields of that data using elasticsearch may be a better option to use from the get-go.
I want this to be auto-scaling but not sure which database would be best to use for this scenario.
Here's some great discussion on Database + Search tier from AWS ReInvent:
https://youtu.be/K7o5OlRLtvU?t=1574
I would not take Elastic-search alone because it does not provide auto-scaling for writing capacity. In fact, it's not trivial to augment the number of shard of an index. Secondly it can only handle the JSON format, which could be an issue for you.
Redis could be a good idea because it is really fast, everything is done in RAM, and it provides keys with a limited time-to-live which could be interesting for you. Unfortunately, if your data size exceeds the capacity in RAM of your amazon instance you will have to shard your Redis database. And Redis does not support it, you will have to deal it on your application code. Moreover, as far as I know Redis does not handle complex queries. You will also need to save your data in a Redis data structure which could be an issue for you
DynamoDB handles auto-scaling really well but on the other hand it is a key/value database so it does not allow you to make queries like "get all data for user X between a date range". DynamoDB also allows you to save your data in any format.
The solution will be to use either DynamoDB or either Redis depending of the size of your datas, and to use ElasticSearch in order to index your key with only the meta-data (user and dates). Like that your index will be small, and if you lost the ability to index because of ElasticSearch get too buzy, you keep the ability to save user's datas.