I am trying to speed spark sql queries by introduce ignite as cache layer, by using IgniteRDD. From the example by ignite doc, it loads data from ignite cache to construct the RDD. But in our usecase the data size may too big to put into ignite memory, actually we just put the data in hbase, so is it possible to do:
1, construct igniteRDD by loading data from hbase
2, Just use ignite to cache share rdd which is generated by spark sql to speed up spark sql.
There are two possible usage scenarios.
First approach. If you run Ignite SQL queries from Spark using igniteRdd.sql(...) method then all the data must be stored in an Ignite cluster. Ignite SQL engine cannot query an underlying 3rd party persistence layer if not all the data is cached in memory. But if you enable Ignite persistence and store all your data there instead of HBase then you can cache as much data as possible and run SQL safely since Ignite can query its own persistence.
Second approach is to use HBase as a cache store (need to implement your own version since there's nothing out-of-the-box) and use Spark SQL queries instead of Ignite SQL because the latter requires us to cache all the data in RAM if Ignite persistence is not used.
Third approach is to try out Ignite in-memory file system (IGFS) and Hadoop accelerator. IGFS and the accelerator are deployed on top of HDFS. However, here you cannot use IgniteRDDs API because all the operations will go through this pipeline Spark->HBase->IGFS+Accelerator+HDFS.
If I were to choose I would go for the first approach.
Apart from above three approaches, if you have flexibility to add another component, use Apache Phoenix. It supports integration with Spark SQL. You can check it on their official website. In this case you will not need Apache Ignite.
Related
I have a set of parquet files on S3 (parquet files get refreshed every 1/2 hr). we have a working spark job which internally creates an SQL and runs it on the parquet files.
We have a parallel system where the data is also replicated on Elastic search.
The requirement is to expose a REST api to check if the ES queries generated result into the same result set as the spark SQLs. ES is easy, as ES gives a REST endpoint. the problem is with SPARK --Because every time i have to spin a new job write the output to a json file; have another micro-service which reads the spark output json and streams it over REST.
Instead i was thinking if
we can run a jetty server in spark driver. Run spark in client mode. Now spark runs 24X7
Jetty will expose the rest api from spark driver. The dataframes etc are all in spark context (meaning i will save spark startup time here). Essentially it takes request runs the sql against the available dataframes in the spark context.
The jetty server should support multiple concurrent requests.
we wil refresh the dataframes every 1/2 hr
is there a better solution over this. any pros and corns of the approach? please let me know. Any code samples will be of great help.
What should I use cache.put(key, value) or cache.query("INSERT INTO Table ")?
In case you properly configured queryable fields for your cache you can use both ways to insert data into the cache:
Key-Value API as shown here.
SqlFieldsQuery as described here.
Also, in case you would like to upload a large amount of data you can use Data Streamer, which automatically buffer the data and group it into batches for better performance.
Any. Or both.
One of the powers of Ignite is that it's truly multi-model - the same data is accessible via different interfaces. If you migrate a legacy app from an RDBMS, you'll use SQL. If you have something simple and don't care about the schema or queries, you'll use key-value.
In my experience, non-trivial systems based on Apache Ignite tend to use different kinds of access simultaneously. A perfectly normal example of an app:
Use key-value to insert the data from an upstream source
Use SQL to read and write data in batch processing and analytics
Use Compute with both SQL and key-value inside of the tasks to do colocated processing and fast analytics
I am trying to position Ignite as Query Grid for databases such as Kudu, Hbase, etc.. Thus, all data silos will be queried over Ignite with read/write through. How this is possible? Are there any integrations with them?
The first time, SQL query runs, it will need to pull the data from such databases and create the key/value on Ignite.
Then, if one/two/three node goes down, eventually the data stored in memory will be lost. How the recovery is done or not possible?
Thanks
CK
Ignite SQL is unable to load specific data by query from external store, it's only possible on API get()/getAll() operations. To be able querying data you need load them into Ignite at first, for example, with loadCache(). Internally this function does a query to target database and transforms response into key-value manner.
BTW, if you enable persistence in Ignite, it will know the structure of data and will be able to query them, even if not all entries loaded into memory.
In case of node crash traditionally used data replication between nodes. In Ignite it's named backups. If you loose more nodes than backups set, then you'll need to preload data from store again.
I really love the Apache Ignite's shared RDD for spark. However, due to the limitation, I can not deploy ignite onto cluster nodes. The only way I can use Ignite is throuhgh embedded mode with Spark.
I would like to knowledge, in embedded mode, can the RDD shared through different Spark applications?
I have two Spark jobs:
Job 1: Produce the data, and stores into the shared RDD
Job 2: retrieve the data from the Shared RDD, and do some calculation.
Can this task be done using ignite's embedded mode?
Thanks
In embedded mode Ignite nodes are started within executors which are under control of Spark. Having said that, this mode is more for testing purposes in my opinion - no need to deploy and start Ignite separately while having an ability to try basic functionality. But in real scenarios it would be very hard to achieve consistency and failover guarantees as Spark can start and stop executors which in case of embedded mode are actually holding the data. I would recommend to work around your limitation and make sure Ignite can be installed separately in standalone mode.
I want to know how to load data from Amazon S3 to an Apache Ignite cluster? Would single node or multi node cluster required?
You can load data to any cluster, single node or multi node, as long as your data set fits in memory of this cluster. Please refer to this documentation page for information about data loading: https://apacheignite.readme.io/docs/data-loading
you can use Spark + Ignite as a workaround, spark to read S3 and then write to ignite as explained in the Ignite examples.
Also, you can use spark structured streaming trigger once just to write incremental files to Ignite, combine spark structured streaming and ignite.
https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
https://github.com/apache/ignite/blob/85af9c789a109f7f067145972a82693c7d28b4a9/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteDataFrameWriteExample.scala#L74
https://github.com/apache/ignite/blob/85af9c789a109f7f067145972a82693c7d28b4a9/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteDataFrameWriteExample.scala#L89
Not sure whether sparK overwrites ignite tables workaround would be to create data frame on existing ignite data and union all latest data and then overwrite ignite table.
https://github.com/apache/ignite/blob/85af9c789a109f7f067145972a82693c7d28b4a9/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteDataFrameWriteExample.scala#L113