Querying Cache from Compute Job - ignite

Is it possible to Query Ignite Cache from within a Compute Job in Apache Ignite? Can someone point me to any existing examples

Never mind, an answer is posted here: http://apache-ignite-users.70518.x6.nabble.com/newbie-question-how-best-to-pass-Ignite-to-a-ComputeJob-td503.html, using InstanceResource an instance of Ignite can be accessed.

Related

Schedule task to load BigQuery table into Apache Ignite

I have a use case where we need to periodically load BigQuery table in to a cache and support SQL query from there. I'm doing researching on Apache Ignite and think it could be a good fit to our use case. Only that it's not clear to me yet how I can get auto-load from BigQuery. By "auto-load" I mean to keep Apache Ignite updated with BigQuery table data and let this updating transparent to applications. In most cases, our BigQuery tables are updated by other scheduled jobs/queries with intervals from 5 minutes to 1 month.
I'm new to Ignite, and I guess my questions are as the following:
Is this a feature supported in Ignite already? (I couldn't find any)
Or is there any exiting pluggins already? (I couldn't find any)
how to implement the auto-load cache for BigQuery using Ignite?
You can do this once with Cache Store / loadCache(), but doing this every few minutes is infeasible. You may wish to design a BigQuery streamer to Apache Ignite, if it supports pushing of deltas.
If Google BigQuery doesn't open its changelog files for CDC tools then find how to capture those updates differently and stream them to Ignite via its IgniteDataStreamer API. There should be a way to capture the changes with some pub/sub mechanism.

Apache Ignite: Control data

I want to ask a question:
How does Apache Ignite distribute data?
How can I control the distribution in Apache Ignite?
For example, I want to distribute more data to some nodes (because they have more memory, and able to save more data), and less data to others nodes
Thank you!!
If you want to do this for one cache you can implement your version of affinity function (https://apacheignite.readme.io/docs/affinity-collocation#section-affinity-function), but this is not recommended because it will not be scalable. If you just want to specify mapping of node to the new cache you can try nodeFilter in CacheConfiguration.

Ignet query on local node pontential issue?

New to ignite, i have a use case, i need to run a job to clean up. I have ignite embedded in our spring boot application, for multiple instances, i am thinking have the job run on each instance, then just query the local data and clean up those. Do you see any issue with this? I am not sure how often ignite does reshuffing data?
Thanks
Shannon
You can surely do that.
With regards to data reshuffling, it will only happen when node is added or removed to cluster. However, ignite.compute().affinityRun() family of calls guarantees that code is ran near the data.
Otherwise, you could do ignite.compute().broadcast() and only iterate on each affected cache's local entries. You don't have the aforementioned guarantee then, though.

Spark SQL with Ignite

I am trying to speed spark sql queries by introduce ignite as cache layer, by using IgniteRDD. From the example by ignite doc, it loads data from ignite cache to construct the RDD. But in our usecase the data size may too big to put into ignite memory, actually we just put the data in hbase, so is it possible to do:
1, construct igniteRDD by loading data from hbase
2, Just use ignite to cache share rdd which is generated by spark sql to speed up spark sql.
There are two possible usage scenarios.
First approach. If you run Ignite SQL queries from Spark using igniteRdd.sql(...) method then all the data must be stored in an Ignite cluster. Ignite SQL engine cannot query an underlying 3rd party persistence layer if not all the data is cached in memory. But if you enable Ignite persistence and store all your data there instead of HBase then you can cache as much data as possible and run SQL safely since Ignite can query its own persistence.
Second approach is to use HBase as a cache store (need to implement your own version since there's nothing out-of-the-box) and use Spark SQL queries instead of Ignite SQL because the latter requires us to cache all the data in RAM if Ignite persistence is not used.
Third approach is to try out Ignite in-memory file system (IGFS) and Hadoop accelerator. IGFS and the accelerator are deployed on top of HDFS. However, here you cannot use IgniteRDDs API because all the operations will go through this pipeline Spark->HBase->IGFS+Accelerator+HDFS.
If I were to choose I would go for the first approach.
Apart from above three approaches, if you have flexibility to add another component, use Apache Phoenix. It supports integration with Spark SQL. You can check it on their official website. In this case you will not need Apache Ignite.

Can we copy Apache Ignite Cluster to another Ignite cluster?

I want to back up entire Ignite cluster so that back up clutser will be used if the original(active) cluster is down. Is there any approach for this?
If you need two separate clusters with replication across data center, it would be better to look at GridGain solutions that supports Datacenter Replication.
Unfortunately, Ignite does not support DR.
With Apache Ignite you can logically divide you cluster to two zones to have guarantee that every zone contains full copy of data. However, there is no way to choose primary node for partitions manually. See, AffinityFunction and affinityBackupFilter() method of standard implementations.
As answered above, ready made solution is only available in paid version. Open source Apache ignite provides ability to take cluster wide absolute snapshot. You can add a cron job in your ignite cluster to take this snapshot and add another job to copy snapshot data to object storage like S3.
On the other side, you download this data node wise to work directories of respective nodes as per manual restore procedure and start the cluster. It should automatically activate when all baseline nodes are started successfully and your cluster is ready to use.