IgniteRDD saveValues too slow - ignite

We are trying to create Ignite RDD from Spark rdd as below
def storeDataframeInCache(sc: SparkContext, rdd: RDD[Row]): Unit ={
var igniteContext: IgniteContext[String, Row] = new IgniteContext[String, Row](sc, () => IgniteConfig.getIgniteConf(true), false)
var igniteRDD = igniteContext.fromCache("rdd")
igniteRDD.saveValues(rdd)
}
Here saveValues takes too much time.
Is there a better possible way of doing this?
Thanks in Advance!!!

There may be different reasons for a poor performance. You should figure out whether the problem is in Spark or Ignite and start optimizing the slow one.
Cache performance strongly depends on its configuration. The more copies of the same data is stored on the cluster, the slower it will work. If you want the cache to work fast, you should choose partitioned mode over replicated and disable backups. Persistence also may negatively affect cache's performance. You may refer documentation for more information: https://apacheignite.readme.io/docs/performance-tips

Related

Skip "Serialization/Deserialization" in SpringBatch Partitioning to scale data copy to million records

I am using Spring Batch Partitioning strategy to setup an ETL process (Copy data from one db to another), potentially few million records.
When I increase the input payload size to more than a million, the process fails with Out of Memory (Testing with 14GB RAM). After analyzing, I discovered that SpringBatch is creating significant byte[] and String objects, which are holding significant portion of memory. These objects were created by MapStepExecutionDao class, while trying to save the StepExecution for each partition. Below is the call flow:
org.springframework.util.SerializationUtils.serialize(Object)
org.springframework.batch.core.repository.dao.MapStepExecutionDao.copy(StepExecution)
org.springframework.batch.core.repository.dao.MapStepExecutionDao.saveStepExecution(StepExecution)
org.springframework.batch.core.repository.dao.MapStepExecutionDao.saveStepExecutions(Collection)
org.springframework.batch.core.repository.support.SimpleJobRepository.addAll(Collection)
In this case, the ExecutionContext within StepExecution has one large string (in format “abc, def, ………………………xyz”), I have to use this as an input to query data due to table structure. MapStepExecutionDao.copy serialize and then deserialize the StepExecution object to create a copy of it. Is there a way to SKIP the “serialization/deserialization” of StepExecution to get its copy, as it is creating additional byte[] and strings which are taking significant heap, and heap consumption increase (more than I expect) with input size.
private static StepExecution copy(StepExecution original) {
return (StepExecution)SerializationUtils.deserialize(SerializationUtils.serialize(original));
}
Please let me know
If there is a way to skip the “serialization/deserialization” of StepExecution ?
Is there any configuration in SpringBatch that can avoid “serialization/deserialization”
Is there a way to extend/override this behavior
This is a known issue of the Map-based DAOs. That's one of the reasons why we decided to deprecate them for removal, see https://github.com/spring-projects/spring-batch/issues/3780.
The recommended replacement is to use the JDBC-based job repository with an in-memory database. The issue I shared above contains a benchmark that you can use to test the performance improvement.

Solution for Spark Plan Growing too Large and Job Erroring out

My Spark/Scala job is processing data in a loop. I need to use the result of the previous loop to process information. I was caching the results, but the issue is that the spark plan for the cached DataFrame grows too large and my job errors out. While I have found a back up solution of writing to and reading from S3, it is time consuming. So, if there is a way to utilize the cache without growing the spark plan I would like to try it out. Or if there is another way to keep track of the computed data within the Spark program that would also help.
You can use Dataset.checkpoint() method at the end of each iteration.
It will persist your current RDD state and truncate the plan.
You need to manually set a checkpoint directory before hand using SparkContext.setCheckpointDir(), since it's unset by default.
You just need to increase the spark.driver.memory to a value that is more than the size of the plan

Efficiently perform COUNT DISTINCT with spark, on csvs?

I have a large volume of data, and I'm looking to efficiently (ie, using a relatively small Spark cluster) perform COUNT and DISTINCT operations one of the columns.
If I do what seems obvious, ie load the data into a dataframe:
df = spark.read.format("CSV").load("s3://somebucket/loadsofcsvdata/*").toDF()
df.registerView("someview")
and then attempt to run a query:
domains = sqlContext.sql("""SELECT domain, COUNT(id) FROM someview GROUP BY domain""")
domains.take(1000).show()
my cluster just crashes and burns - throwing out of memory exceptions or otherwise hanging/crashing/not completing the operation.
I'm guessing that somewhere along the way there's some sort of join that blows one of the executors' memory?
What's the ideal method for performing an operation like this, when the source data is at massive scale and the target data isn't (the list of domains in the above query is relatively short, and should easily fit in memory)
related info available at this question: What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
I would suggest to tune your executors settings. Especially, setting following parameters correctly can provide dramatic improvement in performance.
spark.executor.instances
spark.executor.memory
spark.yarn.executor.memoryOverhead
spark.executor.cores
In your case, I would also suggest to tune Number of partitions, especially bump up following param from default 200 to higher value, as per requirement.
spark.sql.shuffle.partitions

Improving write performance in Hive

I am performing various calculations (using UDFs) on Hive. The computations are pretty fast enough, but I am hitting a roadblock with the write performance in Hive. My result set is close to ten million records, and it takes a few minutes to write them to the table. I have experimented with cached tables and various file formats (ORC AND RC), but haven't seen any performance improvement.
Indexes are not possible since I am using Shark. It would be great to know the suggestions from the SO community on the various methods which I can try to improve the write performance.
Thanks,
TM
I don't really use shark since it is deprecated but I believe it has the ability to read and write parquet files just like spark SQL. In spark SQL it is trivial(from website):
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a JavaSchemaRDD.
val parquetFile = sqlContext.parquetFile("people.parquet")
Basically parquet is your best bet at improving IO speeds without considering another framework (impala is supposed to be extremely fast, but queries are more limited). This is because if you have a table with many rows parquet allows you to only deserialize the needed rows since it is stored in a columnar format. In addition that deserialization maybe faster then with normal storage since storing data of the same types next to each other in memory can offer better compression rates. Also as I said in my comments it would be a good idea to upgrade to spark SQL since shark is no longer being supported and I don't believe there is much difference in terms of syntax.

How do you improve performance on a pig job that has very skewed data?

I am running a pig script that performs a GROUP BY and a nested FOREACH that takes hours to run due to one or two reduce tasks. For example:
B = GROUP A BY (fld1, fld2) parallel 50;
C = FOREACH B {
U = A.fld1;
DIST = DISTINCT U;
GENERATE FLATTEN(group), COUNT_STAR(DIST);
}
Upon examining the counters for the slow tasks, I realized that it looks like the two reducers are processing through a lot more data than the other tasks. Basically, my understanding is that the data is very skewed and so the tasks that are "slow" are in fact doing more work than the fast tasks. I'm just wondering how to improve performance? I hate increasing the parallelism to try to split up the work but is that the only way?
The first option is to use a custom partitioner. Check out the documentation on GROUP for more info (check out PARTITION BY, specifically). Unfortunately, you probably have to write your own custom partitioner here. In your custom partitioner, send the first huge set of keys to reducer 0, send the next set to reducer 1, then do the standard hash partitioning across what's left. What this does is lets one reducer handle the big ones exclusively, while the others get multiple sets of keys. This doesn't always solve the problem with bad skew, though.
How valuable is the count for those two huge sets of data? I see huge skew a lot when things like NULL or empty string. If they aren't that valuable, filter them out before the GROUP BY.