Apache Ignite and Apache Spark integration, Cache loading into Spark Context using IgniteRDD - ignite

If I create a igniteRDD out of a cache with 10M entries in my spark job, will it load all 10M into my spark context? Please find my code below for reference.
SparkConf conf = new SparkConf().setAppName("IgniteSparkIntgr").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
JavaIgniteContext<Integer, Subscriber> igniteCxt = new JavaIgniteContext<Integer,Subscriber>(context,"example-ignite.xml");
JavaIgniteRDD<Integer,Subscriber> cache = igniteCxt.fromCache("subscriberCache");
DataFrame query_res = cache.sql("select id, lastName, company from Subscriber where id between ? and ?", 12, 15);
DataFrame input = loadInput(context);
DataFrame joined_df = input.join(query_res,input.col("id").equalTo(query_res.col("ID")));
System.out.println(joined_df.count());
In the above code, subscriberCache is having more than 10M entries. Will at any point of the above code the 10M Subscriber objects be loaded into JVM? Or it only loads the query output?
FYI:(Ignite is running in a separate JVM)

cache.sql(...) method queries the data that is already in Ignite in-memory cache, so before doing this you should load the data. You can use IgniteRDD.saveValues(...) or IgniteRDD.savePairs(...) method for this. Each of them will iterate through all partitions and load all the data that currently exists in Spark into Ignite.
Note that any transformations or joins that you're doing with the resulting DataFrame will be done locally on the driver. You should avoid this as much as possible to get the best performance from Ignite SQL engine.

Related

Retrieving data from s3 bucket in pyspark

I am reading data from s3 bucket in pyspark . I need to parallelize read operation and doing some transformation on the data. But its throwing error. Below is the code.
s3 = boto3.resource('s3',aws_access_key_id=access_key,aws_secret_access_key=secret_key)
bucket = s3.Bucket(bucket)
prefix = 'clickEvent-2017-10-09'
files = bucket.objects.filter(Prefix = prefix)
keys=[k.key for k in files]
pkeys = sc.parallelize(keys)
I have a global variable d which is an empty list. And I am appending deviceId data into this.
applying flatMap on the keys
pkeys.flatMap(map_func)
This the function
def map_func(key):
print "in map func"
for line in key.get_contents_as_string().splitlines():
# parse one line of json
content = json.loads(line)
d.append(content['deviceID'])
But the above code gives me error.
Can anyone help!
You have two issues that I can see. The first is you are trying to manually read data from S3 using boto instead of using the direct S3 support built into spark and hadoop. It looks like you are trying to read text files containing json records per line. If that is case, you can just do this in spark:
df = spark.read.json('s3://my-bucket/path/to/json/files/')
This will create a spark DataFrame for you by reading in the JSON data with each line as a row. DataFrames require a rigid pre-defined schema (like a relational database table) which spark try to determine will determine by sampling some of your JSON data. After you have the DataFrame all you need to do to get your column is select it like this:
df.select('deviceID')
The other issue worth pointing out is you are attempting to use a global variable to store data computed across your spark cluster. It is possible to send data from your driver to all of the executors running on spark workers using either broadcast variables or implicit closures. But there is no way in spark to write to a variable in your driver from an executor! To transfer data from executors back to the driver you need to use spark's Action methods intended for exactly this purpose.
Actions are methods that tell spark you want a result computed so it needs to go execute the transformations you have told it about. In your case you would probably either want to:
If the results are large: use DataFrame.write to save the results of your tranformations back to S3
If the results are small: DataFrame.collect() to download them back to your driver and do something with them

Lazy Evaluation in Spark. How does Spark load data from DB

If suppose we have set limit of 100 , and Spark Application is connected to the DB with million records.Does Spark load all million record or loads 100 by 100 ?
How does Spark load data from DB? It depends on the database type & its connector implementation. Of course for a distributed processing framework, a distributed data ingestion is always the primary aim for building connectors.
As a brief example, if we have a (1 Mil) records in a table, and we defined the number of partition to be 100 when we load(), then ideally the read task will be distributed to executors so that each executor reads a range of (10,000) records and store them in their corresponding partitions in memory. See SQL Databases using JDBC.
In the Spark UI, you can see that the numPartitions dictate the number of tasks that are launched. Each task is spread across the executors, which can increase the parallelism of the reads and writes through the JDBC interface
Spark provides flexible interfaces (Spark DataSource V2) that allow us to build our own custom datasource connectors. The main design key here is parallelizing the read operation according to how many partitions defined. Also check (figure 4) to understand how distributed CSV ingestion works in Spark.
Update
Read from JDBC connections across multiple workers
df = spark.read.jdbc(
url=jdbcUrl,
table="employees",
column="emp_no",
lowerBound=1,
upperBound=100000,
numPartitions=100
)
display(df)
In the above sample code, we used JDBC read to split the table read across executors on the emp_no column using the partitionColumn, lowerBound, upperBound, and numPartitionsparameters.

Storing data obtained from cassandra in spark Memory and make it available to other spark job server jobs in same context

I am using spark job server and using spark-sql to get data from a cassandra table as follows
public Object runJob(JavaSparkContext jsc, Config config) {
CassandraSQLContext sq = new CassandraSQLContext(JavaSparkContext.toSparkContext(jsc));
sq.setKeyspace("rptavlview");
DataFrame vadevent = sq.sql("SELECT username,plan,plate,ign,speed,datetime,odo,gd,seat,door,ac from rptavlview.vhistory ");
vadevent.registerTempTable("history");
sq.cacheTable("history");
DataFrame vadevent1 = sq.sql("SELECT plate,ign,speed,datetime FROM history where username='"+params[0]+"' and plan='"+params[1]+"'");
long count = vadevent.rdd().count();
}
But I am getting table not found history.
Can anybody mention how to cache cassandra data in spark memory and reuse the same data either in concurrent requests of same job or as two jobs one for caching and other for querying.
I am using dse5.0.4 so spark version is 1.6.1
You can allow spark jobs to share the state of other contexts. This link goes more in depth.

is Parquet predicate pushdown works on S3 using Spark non EMR?

Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR).
Further explanation might be helpful since it might involve understanding on distributed file system.
I was wondering this myself so I just tested it out. We use EMR clusters and Spark 1.6.1 .
I generated some dummy data in Spark and saved it as a parquet file locally as well as on S3.
I created multiple Spark jobs with different kind of filters and column selections. I ran these tests once for the local file and once for the S3 file.
I then used the Spark History Server to see how much data each job had as input.
Results:
For the local parquet file: The results showed that the column selection and filters were pushed down to the read as the input size was reduced when the job contained filters or column selection.
For the S3 parquet file: The input size was always the same as the Spark job that processed all of the data. None of the filters or column selections were pushed down to the read. The parquet file was always completely loaded from S3. Even though the query plan (.queryExecution.executedPlan) showed that the filters were pushed down.
I will add more details about the tests and results when I have time.
Yes. Filter pushdown does not depend on the underlying file system. It only depends on the spark.sql.parquet.filterPushdown and the type of filter (not all filters can be pushed down).
See https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L313 for the pushdown logic.
Here's the keys I'd recommend for s3a work
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
For committing the work. use the S3A "zero rename committer" (hadoop 3.1+) or the EMR equivalent. The original FileOutputCommitters are slow and unsafe
Recently I tried this with Spark 2.4 and seems like Pushdown predicate works with s3.
This is the spark sql query:
explain select * from default.my_table where month = '2009-04' and site = 'http://jdnews.com/sports/game_1997_jdnsports__article.html/play_rain.html' limit 100;
And here is the part of output:
PartitionFilters: [isnotnull(month#6), (month#6 = 2009-04)], PushedFilters: [IsNotNull(site), EqualTo(site,http://jdnews.com/sports/game_1997_jdnsports__article.html/play_ra...
Which clearly stats that PushedFilters is not empty.
Note: The used table was created on top of AWS S3
Spark uses the HDFS parquet & s3 libraries so the same logic works.
(and in spark 1.6 they've added even a faster shortcut for flat schema parquet files)

Spark HiveContext does not retrieve newly inserted records from Hive Table

I am using Spark 1.4. HiveContext is used to connect Hive. I did the following
val hx = new HiveContext(sc)
import hx.implicits._
hx.sql("select * from tab").show
// it is fine, result was shown as expected
then, I inserted a few records into tab from beeline console
hx.refreshTable("tab")
hx.sql("select * from tab").show
// still old records, no newly inserted records
My question is: why the HiveContext didn't retrieve the newly inserted records?
hiveContext.refreshTable(tableName: String) - this will refresh only metadata of the table (not the actual data)
Notes from official documentaition : (credits: https://spark.apache.org)
refreshTable(tableName: String): Unit
Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache
To retrive newly inserted records:- uncache first and cache again using , uncacheTable(String tableName) and cacheTable(String tableName)
If the target table is partitioned, You need to insert with 'partition' option. If you miss out the partition, data will not be visible.
INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...) SELECT col1,col2,.... FROM tablename2
On a differently slight case, I have an RDD coming from a Spark SQL statement via HiveContext. The solution which worked for me after some experiments was to actually regenerate the RDD itself.
It does not matter whether you are using the DDL by Spark SQL or sending SQL statements directly via hiveContext.sql.
I have seen around people using a "count trick" in order to force the recomputation of a dataset but at least in my attempts I couldn't get to see the new data this way.
Anyway trying caching, refreshing and friends did not work for me, if somebody has some proper pattern here please share.