Apache Ignite + Spark Dataframes: Client vs Server Doubts - ignite

I've been trying to integrate ignite and spark. The goal of my application is to write and read spark dataframes to/from ignite. However, I'm facing several issues with larger datasets (> 200 000 000 rows).
I have a 6-node Ignite cluster running on YARN. It has 160Gb of memory and 12 cores. I am trying to save the dataframe using spark (around 20Gb of raw text data) in an Ignite cache (partitioned 1 backup):
def main(args: Array[String]) {
val ignite = setupIgnite
closeAfter(ignite) { _ ⇒
implicit val spark: SparkSession = SparkSession.builder
.appName("Ignite Benchmark")
.getOrCreate()
val customer = readDF("csv", "|", Schemas.customerSchema, "hdfs://master.local:8020/apps/hive/warehouse/ssbplus100/customer")
val part = readDF("csv", "|", Schemas.partSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/part")
val supplier = readDF("csv", "|", Schemas.supplierSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/supplier")
val dateDim = readDF("csv", "|", Schemas.dateDimSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/date_dim")
val lineorder = readDF("csv", "|", Schemas.lineorderSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/lineorder")
writeDF(customer, "customer", List("custkey"), TEMPLATES.REPLICATED)
writeDF(part, "part", List("partkey"), TEMPLATES.REPLICATED)
writeDF(supplier, "supplier", List("suppkey"), TEMPLATES.REPLICATED)
writeDF(dateDim, "date_dim", List("datekey"), TEMPLATES.REPLICATED)
writeDF(lineorder.limit(200000000), "lineorder", List("orderkey, linenumber"), TEMPLATES.NO_BACKUP)
}
}
At some point, the spark application retrieves this error:
class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Out of memory in data region [name=default, initSize=256.0 MiB, maxSize=12.6 GiB, persistenceEnabled=false] Try the following:
^-- Increase maximum off-heap memory size (DataRegionConfiguration.maxSize)
^-- Enable Ignite persistence (DataRegionConfiguration.persistenceEnabled)
^-- Enable eviction or expiration policies
at org.apache.ignite.internal.pagemem.impl.PageMemoryNoStoreImpl.allocatePage(PageMemoryNoStoreImpl.java:304)
at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.allocateDataPage(AbstractFreeList.java:463)
at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.insertDataRow(AbstractFreeList.java:501)
at org.apache.ignite.internal.processors.cache.persistence.RowStore.addRow(RowStore.java:97)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.createRow(IgniteCacheOffheapManagerImpl.java:1302)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4426)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4371)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3083)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$6200(BPlusTree.java:2977)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1726)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1703)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1703)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1610)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1249)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:352)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:3602)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:2774)
at org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$IsolatedUpdater.receive(DataStreamerImpl.java:2125)
at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:140)
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:400)
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.processRequest(DataStreamProcessor.java:305)
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.access$000(DataStreamProcessor.java:60)
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor$1.onMessage(DataStreamProcessor.java:90)
at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184)
at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:125)
at org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1091)
at org.apache.ignite.internal.util.StripedExecutor$Stripe.run(StripedExecutor.java:511)
at java.lang.Thread.run(Thread.java:748)
I think the problem lies in the ignite server being initiated before the spark session, as in the official ignite examples. This server starts caching data that I am writing to the ignite cache and exceeds its default region size max (12Gb, which is different from the 20GB I defined for my yarn cluster). However, I don’t understand how the examples and documentation tells us to create an ignite server before the spark context (and session I assume). I understand that without this the application will hang once all the spark jobs are terminated, but I don’t understand the logic of having a server on the spark application that starts caching data. I’m very confused by this concept, and for now I have setup this ignite instance inside spark to be a client.
This is a strange behavior as all my ignite nodes (running on YARN) have 20GB defined for the default region (I changed it and verified it). This indicates me that the error must come from the ignite servers started on Spark (I think it is one on the driver and one per worker), as I did not changed the default region size in the ignite-config.xml of the spark application (defaults to 12GB as the error demonstrates). However, does this make sense? Should Spark throw out this error being its only goal to read and write data from/to ignite? Is Spark participating in caching any data and does this mean that I should set client mode in the ignite-config.xml of my application, despite the fact that the official examples are not using client mode?
Best regards,
Carlos

First, the Spark-Ignite connector already connects in client mode.
I'm going to assume that you have enough memory, but you can follow the example in the Capacity Planning guide to be sure.
However, I think the problem is that you're following the sample application a bit too closely(!). The sample -- so as to be self-contained -- includes both a server and a Spark client. If you already have an Ignite cluster, you don't need to start a server in your Spark client.
This is a slightly hacked down example from a real application (in Java, sorry):
try (SparkSession spark = SparkSession
.builder()
.appName("AppName")
.master(sparkMaster)
.config("spark.executor.extraClassPath", igniteClassPath())
.getOrCreate()) {
// Get source DataFrame
DataSet<Row> results = ....
results.write()
.outputMode("append")
.format(IgniteDataFrameSettings.FORMAT_IGNITE())
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), igniteCfgFile)
.option(IgniteDataFrameSettings.OPTION_TABLE(), "Results")
.option(IgniteDataFrameSettings.OPTION_STREAMER_ALLOW_OVERWRITE(), true)
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "name")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "backups=1")
.write();
}
I didn't test, but you should get the idea: you need to provide a URL to an Ignite configuration file; it creates the client to connect to that server behind the scenes.

Related

Fetching results from Celery backend is abnormally slow

I'm using Celery with a Redis broker to do some "heavy" processing for my Django app. Everything is running locally in Docker containers on WSL2.
The tasks output a JSON which is roughly 2.5 Mb large and it takes up to 9 seconds to retrieve the result via get() in the Django app. For smaller payloads, the time goes down
I tried increasing the RAM and the CPU for WSL2 up to 6 CPUs and 8Gb RAM. Celery was configured with --max-memory-per-child=1024000 --concurrency=4
I've tried using different result_backend configuration with similar results:
Redis
RPC
SQLite with SQLAlchemy
I tried setting an interval when using SQLite (doesn't matter for RPC & Redis) with a 0.5sec improvement get(interval=0.01)
I also tried changing the result_serializer from JSON to pickle for poorer performance. But I don't think the serializer is the culprit here as serializing / deserializing the same JSON is pretty fast in console
>>> timeit.timeit(lambda: pickle.dumps(big_dict,0), number=10)
0.567067899999528
>>> timeit.timeit(lambda: pickle.loads(str), number=10)
0.3542163999991317
I tried using compression, only zlib seemed to provide a small gain.
I'm not too familiar with this setup but IMHO I should be able to retrieve results faster. The best I could achieve was 6sec. Any idea how to improve this or how to explain it ?
settings.py
CELERY_BROKER_URL = "redis://{host}:{port}/{db}".format(
host=os.environ.get('REDIS_HOST'),
port=os.environ.get('REDIS_PORT'),
db=os.environ.get('CELERY_REDIS_DB')
)
CELERY_RESULT_BACKEND = "redis://{host}:{port}/{db}".format(
host=os.environ.get('REDIS_HOST'),
port=os.environ.get('REDIS_PORT'),
db=os.environ.get('CELERY_REDIS_DB')
)
# CELERY_RESULT_BACKEND = 'db+sqlite:///celery.sqlite' # SQL Example (need SQLAlchemy==1.4.29 in requirements.txt)
# CELERY_RESULT_BACKEND = 'rpc://localhost' # RPC Example
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
Thanks
In general, Redis has a reputation for being bad at dealing with large objects and is not generally intended to be a large object store. You're better off using a general purpose RDBMS or a file store and returning a key to where the JSON can be retrieved.

Why Apache Ignite Cache.replace-K-V-V api call performing slow?

We are running Ignite cluster with 12 nodes running on Ignite 2.7.0 on openjdk
1.8 at RHEL platform.
Seeing heavy cputime spent with https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/IgniteCache.html#replace-K-V-V-
We are witnessing slowness with one of our process and when we tried to drill it
further by profiling the JVM, the main culprit (taking ~78% of total time)
seems to be coming from Ignite cache.repalce(K,V,V) api call.
Out of 77.9 by replace, 39% is taken by GridCacheAdapater.equalVal and 38.5%
by GridCacheAdapter.put
Cache is Partitioned and ATOMIC with readThrough,writeThrough,writeBehindEnabled set to True.
Attaching the profiling snapshot of one node(similar is the profiling result on other nodes), Can someone please check and suggest what
could be the cause OR some known performance issue with this Ignite version related to cache.replace(k,v,v) api ?
JVM Prolfiling Snapshot of one node
I guess that it can be related to next issue:
https://issues.apache.org/jira/browse/IGNITE-5003
The problem there related to the operations for the same key before the previous batch of updates (that contains this key) will be stored in the database.
As I see it should be added to Ignite 2.8.
Update:
I tested putAll operation. From the next two pictures you can see that putAll waiting for GridCacheWriteBehindStore.write (two different threads) that contains updateCache:
public void write(Entry<? extends K, ? extends V> entry) {
try {
if (log.isDebugEnabled())
log.debug(S.toString("Store put",
"key", entry.getKey(), true,
"val", entry.getValue(), true));
updateCache(entry.getKey(), entry, StoreOperation.PUT);
}
And provided issue can affect your put operations (or replace as well).

Apache Ignite - Distibuted Queue and Executors

I am planning to use Apache Ignite Distributed Queue.
I am using Ignite with a spring boot application. So, on bootup, I will be adding 20 names in a queue. But, since there are 3 servers in a cluster, the same 20 names gets added 3 times. But, i want to add them only once in the queue.
Ignite ignite = Ignition.ignite();
IgniteQueue<String> queue = ignite.queue(
"queueName", // Queue name.
0, // Queue capacity. 0 for unbounded queue.
null // Collection configuration.
);
Distributed executors, will be able to poll from the queue and run the task. Here, the executor is expected to poll, run the task and then add the same name to the queue. Trying to achieve round robin here.
Only one executor should be running the same task at any point of time, though there are multiple servers in a cluster.
Any suggestion for this.
You can launch ignite cluster singleton service https://apacheignite.readme.io/docs/cluster-singletons which will fill data to queue. Also you can adding data from coordinator node (oldest node in cluster) ignite.cluster().forOldest().node().isLocal()
I fixed bootup time duplicate cache loading issue this way:
final IgniteAtomicLong cacheLoadCnt = ignite.atomicLong(cacheName + "Cnt", 0, true);
if (cacheLoadCnt.get() == 0) {
loadCache();
cacheLoadCnt.addAndGet(1);
}

How to initiate warming when a cache is created?

I'd like to to use an Ignite cluster to warm a PARTITIONED cache from an existing database. The existing database is not partitioned and expensive to scan, so I'd like to perform a single scan when the cache is created by the cluster. Once the job completes, the result would be a cache containing all data from the existing database partitioned and evenly distributed across the cluster.
How do you implement a job that runs when a cache is created by Ignite?
Ignite integrates with underlying stores via CacheStore [1] implementations. Refer to [2] for details about your particular use case.
[1] https://apacheignite.readme.io/docs/persistent-store
[2] https://apacheignite.readme.io/docs/data-loading
You can create a Service that runs once on cluster start and then cancels itself. It can use a cache to store state, so it will not run if it's deployed in the cluster a second time.
The following abstract Service runs executeOnce once per cluster the first time it's deployed after cluster start:
abstract class ExecuteOnceService extends Service {
val ExecuteOnceCacheName = "_execute_once_service"
val config = new CacheConfiguration[String, java.lang.Boolean](ExecuteOnceCacheName)
.setCacheMode(CacheMode.PARTITIONED)
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
#IgniteInstanceResource
var ignite: Ignite = _
override def execute(ctx: ServiceContext): Unit = {
val cache = ignite.getOrCreateCache(config)
val executed = cache.getAndPutIfAbsent(ctx.name(), java.lang.Boolean.TRUE)
if (executed != java.lang.Boolean.TRUE) executeOnce(ctx)
ignite.services().cancel(ctx.name())
}
def executeOnce(ctx: ServiceContext): Unit
}

Spark : Data processing using Spark for large number of files says SocketException : Read timed out

I am running Spark in standalone mode on 2 machines which have these configs
500gb memory, 4 cores, 7.5 RAM
250gb memory, 8 cores, 15 RAM
I have created a master and a slave on 8core machine, giving 7 cores to worker. I have created another slave on 4core machine with 3 worker cores. The UI shows 13.7 and 6.5 G usable RAM for 8core and 4core respectively.
Now on this I have to process an aggregate of user ratings over a period of 15 days. I am trying to do this using Pyspark
This data is stored in hourwise files in day-wise directories in an s3 bucket, every file must be around 100MB eg
s3://some_bucket/2015-04/2015-04-09/data_files_hour1
I am reading the files like this
a = sc.textFile(files, 15).coalesce(7*sc.defaultParallelism) #to restrict partitions
where files is a string of this form 's3://some_bucket/2015-04/2015-04-09/*,s3://some_bucket/2015-04/2015-04-09/*'
Then I do a series of maps and filters and persist the result
a.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to do a reduceByKey to get an aggregate score over the span of days.
b = a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to make a redis call for the actual terms for the items the user has rated, so I call mapPartitions like this
final_scores = b.mapPartitions(get_tags)
get_tags function creates a redis connection each time of invocation and calls redis and yield a (user, item, rate) tuple
(The redis hash is stored in the 4core)
I have tweaked the settings for SparkConf to be at
conf = (SparkConf().setAppName(APP_NAME).setMaster(master)
.set("spark.executor.memory", "5g")
.set("spark.akka.timeout", "10000")
.set("spark.akka.frameSize", "1000")
.set("spark.task.cpus", "5")
.set("spark.cores.max", "10")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max.mb", "10")
.set("spark.shuffle.consolidateFiles", "True")
.set("spark.files.fetchTimeout", "500")
.set("spark.task.maxFailures", "5"))
I run the job with driver-memory of 2g in client mode, since cluster mode doesn't seem to be supported here.
The above process takes a long time for 2 days' of data (around 2.5hours) and completely gives up on 14 days'.
What needs to improve here?
Is this infrastructure insufficient in terms of RAM and cores (This is offline and can take hours, but it has got to finish in 5 hours or so)
Should I increase/decrease the number of partitions?
Redis could be slowing the system, but the number of keys is just too huge to make a one time call.
I am not sure where the task is failing, in reading the files or in reducing.
Should I not use Python given better Spark APIs in Scala, will that help with efficiency as well?
This is the exception trace
Lost task 4.1 in stage 0.0 (TID 11, <node>): java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:227)
at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at org.apache.http.util.EntityUtils.consume(EntityUtils.java:88)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
at org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:126)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:236)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
I could really use some help, thanks in advance
Here is what my main code looks like
def main(sc):
f=get_files()
a=sc.textFile(f, 15)
.coalesce(7*sc.defaultParallelism)
.map(lambda line: line.split(","))
.filter(len(line)>0)
.map(lambda line: (line[18], line[2], line[13], line[15])).map(scoring)
.map(lambda line: ((line[0], line[1]), line[2])).persist(StorageLevel.MEMORY_ONLY_SER)
b=a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
c=taggings.mapPartitions(get_tags)
c.saveAsTextFile("f")
a.unpersist()
b.unpersist()
The get_tags function is
def get_tags(partition):
rh = redis.Redis(host=settings['REDIS_HOST'], port=settings['REDIS_PORT'], db=0)
for element in partition:
user = element[0]
song = element[1]
rating = element[2]
tags = rh.hget(settings['REDIS_HASH'], song)
if tags:
tags = json.loads(tags)
else:
tags = scrape(song, rh)
if tags:
for tag in tags:
yield (user, tag, rating)
The get_files function is as:
def get_files():
paths = get_path_from_dates(DAYS)
base_path = 's3n://acc_key:sec_key#bucket/'
files = list()
for path in paths:
fle = base_path+path+'/file_format.*'
files.append(fle)
return ','.join(files)
The get_path_from_dates(DAYS) is
def get_path_from_dates(last):
days = list()
t = 0
while t <= last:
d = today - timedelta(days=t)
path = d.strftime('%Y-%m')+'/'+d.strftime('%Y-%m-%d')
days.append(path)
t += 1
return days
As a small optimization, I have created two separate tasks, one to read from s3 and get additive sum, second to read transformations from redis. The first tasks has high number of partitions since there are around 2300 files to read. The second one has much lesser number of partitions to prevent redis connection latency, and there is only one file to read which is on the EC2 cluster itself. This is only partial, still looking for suggestions to improve ...
I was in a similar usecase: doing coalesce on a RDD with 300,000+ partitions. The difference is that I was using s3a(SocketTimeoutException from S3AFileSystem.waitAysncCopy). Finally the issue was resolved by setting a larger fs.s3a.connection.timeout(Hadoop's core-site.xml). Hopefully you can get a clue.