Apache Ignite with Kudu - ignite

I am trying to position Ignite as Query Grid for databases such as Kudu, Hbase, etc.. Thus, all data silos will be queried over Ignite with read/write through. How this is possible? Are there any integrations with them?
The first time, SQL query runs, it will need to pull the data from such databases and create the key/value on Ignite.
Then, if one/two/three node goes down, eventually the data stored in memory will be lost. How the recovery is done or not possible?
Thanks
CK

Ignite SQL is unable to load specific data by query from external store, it's only possible on API get()/getAll() operations. To be able querying data you need load them into Ignite at first, for example, with loadCache(). Internally this function does a query to target database and transforms response into key-value manner.
BTW, if you enable persistence in Ignite, it will know the structure of data and will be able to query them, even if not all entries loaded into memory.
In case of node crash traditionally used data replication between nodes. In Ignite it's named backups. If you loose more nodes than backups set, then you'll need to preload data from store again.

Related

Can i query directly Redis database (persistent not in-memory) or data is always kept in-memory and requests are executed against the in-memory data?

A simple question about using Redis as a persistent database (not in-memory):
Can I directly query the Redis database from my spring boot application (just like with MySQL or Oracle db) or data should always be loaded in-memory first and requests are to be executed against the in-memory data?
Thanks.
When you query data from Redis it does not load that data in memory at that point. Redis is an in-memory database, meaning it always keeps all the data in it's memory, and when you send the query to redis, it processes it against the data that is already in memory.
Redis is an in-memory database which you can treat like any other external dependency you may have in your application. Compared with the other databases you mentioned, it does not offer the ability to use SQL to query it, so you must rely on its own commands, which are very specific.
There are some Java clients you can use to interact with Redis, including Lettuce and Jedis. The commands you send to Redis are executed against the data that Redis itself keep in its own memory.

Cons of using MemoryCache as a temporary copy of DB table

I have a site where you can list your car for sale. There is a list and a map with filtering on car types and other car specifications. My idea was to cache cars table and use that to filter on when user is searching for a car on the website. Currently, especially when zooming in/out on the map, each time user does that, http request is made and it's querying the database, and that can be slow and heavy on the server.
As an experiment with 1 000 items, I have cached map data (trimmed data with only basic info) and it's working fine. I was thinking of doing a basically copy of cars table instead with all needed joins added in Memory Cache and use that instead of querying the DB every request for both list and the map. I would have Cron Job every 5 minutes (as data can change, but it doesn't have to be immediate) to update Memory Cache with latest cars data from DB.
What would be the cons of using this approach in long term and for using it for example storing 100 000 records? Beside server needing more RAM, would there be any concerns about scalability or usability of this approach? Would it be better to use Redis instead?
I do have in place now "search as you type" service, but I don't really need that functionality as filtering is pretty exact, I have added it more as a caching server but I think I would be better off just using Memory Cache until a real need for that kind of service is required.
Thank you
Since memory isn’t infinite, we need to limit the number of items stored in the In-Memory cache.
MemoryCache VS Redis
MemoryCache
MemoryCache is embedded in the process , hence can only be used as a plain key-value store from that process.
Redis
Redis is a remote data structure server. It is certainly slower than just storing the data in local memory.
I conclude that MemoryCache is running in the web server of the current application, and it is limited by the performance of the web server. Of course, it will be very fast under the same configuration. I think the disadvantage is that the stored data cannot be shared with other applications.
If redis is used, reading data directly from memory is not as fast as memorycache, but it has high reliability and high scalability.
Related Post:
1. How to update redis after updating database?
2. how to keep caching up to date
3. How can MySQL update data in real time in redis cache?

jboss data grid for clustered enterprise application - what is the efficient way

we are having a clustered enterprise application using JTA transaction and hibernate for database operations deployed on JBoss EAP.
To increase system performance we are planning to use Jboss data grid. This is how I plan to use jboss data grid:
I am adding/replacing the object is cache whenever its inserted/updated in database using cache.put
when object is deleted from database its deleted from cache using cache.remove
while retrieving, first try to get the data from cache using key or query. If data is not present, load the data from database.
However, I have below questions on data grid:
To query objects we are using hibernate criteria however data grid uses its own query builder. Can we avoid writing separate query for hibernate and datagrid?
I want a list of objects to be returned matching a criteria. If one of the objects matching the criteria is evicted from cache, is it reloaded automatically from database?
If the transactions is rolled back is it rolled back from data grid cache as well
Are there any examples which I can refer for my implementation of data grid?
which is better choice for my requirement infinispan as 2nd level cache or data grid in library or remote mode?
Galder's comment is right, the best practice is using Infinispan as the second-level cache provider. Trying to implement it on your own is very prone to timing issues (you'd have stale/non-updated entries in the cache).
Regarding queries: With 2LC query caching on the cache keeps a map of 'sql query' -> 'list of results'. However once you update any type that's used in a query, all such queries are invalidated (e.g. if the query lists people with age > 60, updating a newborn still invalidates that query). Therefore this should be on only when the queries prevail over updates.
Infinispan has its own query support but this is not exposed when using it as 2LC provider. It is assumed that the cache will hold only a (most frequently accessed) subset of the entities in the database and therefore the results of such queries would not be correct.
If you want to go for Infinispan but keep the DB persistence, an option might be using JPA cache store (and indexing). Note though that updates to DB that don't go through Infinispan would not be reflected in the cache, and the indexing may lag a bit (since it's asynchronous). You can split your dataset and use JPA for one part and Infinispan + JPA cache store for the other, too.
A third option is using Hibernate Search, which keeps the data in database but index is in Lucene (possibly stored in Infinispan caches, too) and you don't use the Criteria API but Hibernate Search API.

Spark SQL with Ignite

I am trying to speed spark sql queries by introduce ignite as cache layer, by using IgniteRDD. From the example by ignite doc, it loads data from ignite cache to construct the RDD. But in our usecase the data size may too big to put into ignite memory, actually we just put the data in hbase, so is it possible to do:
1, construct igniteRDD by loading data from hbase
2, Just use ignite to cache share rdd which is generated by spark sql to speed up spark sql.
There are two possible usage scenarios.
First approach. If you run Ignite SQL queries from Spark using igniteRdd.sql(...) method then all the data must be stored in an Ignite cluster. Ignite SQL engine cannot query an underlying 3rd party persistence layer if not all the data is cached in memory. But if you enable Ignite persistence and store all your data there instead of HBase then you can cache as much data as possible and run SQL safely since Ignite can query its own persistence.
Second approach is to use HBase as a cache store (need to implement your own version since there's nothing out-of-the-box) and use Spark SQL queries instead of Ignite SQL because the latter requires us to cache all the data in RAM if Ignite persistence is not used.
Third approach is to try out Ignite in-memory file system (IGFS) and Hadoop accelerator. IGFS and the accelerator are deployed on top of HDFS. However, here you cannot use IgniteRDDs API because all the operations will go through this pipeline Spark->HBase->IGFS+Accelerator+HDFS.
If I were to choose I would go for the first approach.
Apart from above three approaches, if you have flexibility to add another component, use Apache Phoenix. It supports integration with Spark SQL. You can check it on their official website. In this case you will not need Apache Ignite.

Sql query over Ignite CacheStore or over database

I am a beginner for Ignite, so I have some puzzles, one of which is as follows:when I try to query cache, whether it can look if memory contains or not. If not, then whether it will query database? If not,how to achieve such way?
Please help me if you know.Thx.
Queries work over in-memory data only. You can either use key access (operations like get(), getAll(), etc.) and utilize automatic read-through from the persistence store, or manually preload the data before running queries. For information on how effectively load large data set into the cache, see this page: https://apacheignite.readme.io/docs/data-loading