Using large dataframe in-memory (data not allowed to be "at rest") results in driver crash and/or out of memory - Azure Databricks - dataframe

I'm having trouble working on Databricks with data that we are not allowed to save off or persist in any way. The data comes from an API (which returns a JSON response). We have a scala package on our cluster that makes the queries (almost 6k queries), saves them to a dataframe, then explodes that dataframe into a new dataframe that we can use to get the info we need. However, all the data gets collected to the driver node. Then, when we try to run comparisons/validations/spark code using the resulting dataframe, it won't distribute the work (tries to do everything on the driver) and takes up all the driver's JVM Heap memory.
The two errors we get are OutOfMemoryError: Java Heap Space or The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
We've got a 64GB driver (with 5 32GB nodes) and have increased the max JVM Heap Memory to 25GB. But because it won't distribute the work to the nodes, the driver keeps crashing with the OOM errors.
Things run fine when we do save a parquet of this data and run our spark code/comparisons against the parquet (distributes the work fine), but, per an agreement with the data owners, we are only allowed to save data like this for development purposes (currently not a viable option for production).
Things we've tried:
-Removing all displays/prints/logs
-Caching the dataframe (and prior/subsequent dataframes to which it is connected) with .cache()/.count() and spark.sql("CACHE TABLE <tablename>")
-Increasing the size of the driver node to 64GB
-Increasing the JVM heap size to 25GB
-Removing unneeded columns from the dataframe
-Using .repartition(300) on the dataframe

Related

Does dataframe.repartition(x) makes execution faster

I have a Spark script that reads data from amazon S3 and then writes in another bucket usion parquet format.
This is what the code looks like:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods.repartition(25).write.format("parquet").mode("OverWrite").save("AnotherLocationInS3")
My question is: how does the repartition argument (here 25) affects the execution time? Should I increase it so the script runs faster?
Second question: Would it be better if I cache my df before the last line?
Thank you
In typical setups neither repartition nor cache will help you in this specific case. Since you read data from non-splittable format:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods will have only one partition.
In such case repartitioning would make sense, if you performed any actual processing on this data.
However if you just write to distributed file system repartitioning will simply double the cost - you have to send data to other nodes first (that involves serialization, deserialization, network transfer, write to disk) and then still write to distributed file system.
There are of course edge cases when this makes sense. If network connecting your cluster is much faster than network connection your cluster to S3 nodes, effective latency might be a bit lower.
As of caching ‒ there is no value in caching here at all. Caching Dataset is expensive, and makes sense only if persisted data is reused.
Answer 1 :- Repartition of 25 or more or less it depends on how much data you have and no. of executors you provided. If your Spark code run in the cluster with more than one executor and it is not repartitioned then repartitioning will speedy to writing parallel your data.
Answer 2 :- There is no need to cache df before the last line because you are using only single action in your code. If you will perform multiple actions on your DF and don't want it will recalculate as the number of actions then you will Cache it.
The thing here is that Spark can parallelize writing to a certain point since one file can't be written by multiple executors at the same time.
Repartition helps you in this parallelization because it will write 25 different files (one for each partition). If you increase the number of partitions you will increase the number of written files hence speeding up the execution. This comes with a price because of the reading time will increase with the number of files to be read.
The limit is the number of executors you are running your job with, e.g. if you are running with 25 executors then setting repartition to 26 will not help you because to write the 26th partition one of the previous 25 would have to be finished.
For the other question, I don't think .cache() will help you because Spark is lazy, maybe this article can help you further.

how can i write to one csv file fast?

I am trying to repartition(1) the dataframe when writing to csv but it is running for more than 2 hours..i tried repartition(20) but it still very slow. I think the data is big and I am new to spark, how can I make this faster?
df.repartition(20).write.format("com.databricks.spark.csv").option("header", "true").save(filepath)
Are you running in it on you local machine or remotely ?
Is it standalone/yarn cluster ,how many machines you have.
You can check in the Spark UI for the task see how many partitions are ?
Per machine you should have minimum 3*4 partitions - maxcan go upto 10000
Rather than repartition do coalesce(1) for just one partiton,it will cause less shuffle and job will e fast.
repartition causes more shuffle.

Analyzing data flow of Dask dataframes

I have a dataset stored in a tab-separated text file. The file looks as follows:
date time temperature
2010-01-01 12:00:00 10.0000
...
where the temperature column contains values in degrees Celsius (°C).
I compute the daily average temperature using Dask. Here is my code:
from dask.distributed import Client
import dask.dataframe as dd
client = Client("<scheduler URL")
inputDataFrame = dd.read_table("<input file>").drop('time', axis=1)
groupedData = inputDataFrame.groupby('date')
meanDataframe = groupedData.mean()
result = meanDataframe.compute()
result.to_csv('result.out', sep='\t')
client.close()
In order to improve the performance of my program, I would like to understand the data flow caused by Dask data frames.
How is the text file read into a data frame by read_table()? Does the client read the whole text file and send the data to the scheduler, which partitions the data and sends it to the workers? Or does each worker read the data partitions it works on directly from the text file?
When an intermediate data frame is created (e.g. by calling drop()) is the whole intermediate data frame sent back to the client and then sent to the workers for further processing?
The same question for groups: where is the data for a group object create and stored? How does it flow between client, scheduler and workers?
The reason for my question is that if I run a similar program using Pandas, the computation is roughly two times faster, and I am trying to understand what causes the overhead in Dask. Since the size of the result data frame is very small compared to the size of the input data, I suppose there is quite some overhead caused by moving the input and intermediate data between client, scheduler and workers.
1) The data are read by the workers. The client does read a little ahead of time, to figure out the column names and types and, optionally, to find line-delimiters for splitting files. Note that all workers must be able to reach the file(s) of interest, which can require some shared file-system when working on a cluster.
2), 3) In fact, the drop, groupby and mean methods do not generate intermediate data-frames at all, they just accumulate a graph of operations to be executed (i.e., they are lazy). You could time these steps and see they are fast. During execution, intermediates are made on workers, copies to other workers as required, and discarded as soon as possible. There are never copies to the scheduler or client, unless you explicitly request so.
So, to the root of your question: you can investigate the performance or your operation best by looking at the dashboard.
There are many factors that govern how quickly things will progress: the processes may be sharing an IO channel; some tasks do not release the GIL, and so parallelise poorly in threads; the number of groups will greatly affect the amount of shuffling of data into groups... plus there is always some overhead for every task executed by the scheduler.
Since Pandas is efficient, it is not surprising that for the case where data fits easily into memory, it performs well compared to Dask.

Redis 1+ min query time on lists with 15 larger JSON objects (20MB total)

I use Redis to cache database inserts. For this I created a list CACHE into which I push serialized JSON lists. In pseudocode:
let entries = [{a}, {b}, {c}, ...];
redis.rpush("CACHE", JSON.stringify(entries));
The idea is to run this code for an hour, then later do an
let all = redis.lrange("CACHE", 0, LIMIT);
processAndInsert(all);
redis.ltrim("CACHE", 0, all.length);
Now the thing is that each entries can be relatively large (but far below 512MB / whatever Redis limit I read about). Each of the a, b, c is an object of probably 20 bytes, and entries itself can easily have 100k+ objects / 2MB.
My problem now is that even for very short CACHE lists of only 15 entries a simple lrange can take many minutes(!) even from the redis-cli (my node.js actually dies with an "FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory", but that's a side comment).
The debug output for the list looks like this:
127.0.0.1:6379> debug object "CACHE"
Value at:00007FF202F4E330 refcount:1 encoding:linkedlist serializedlength:18104464 lru:12984004 lru_seconds_idle:1078
What is happening? Why is this so massively slow, and what can I do about it? This does not seem to be a normal slowness, something seems to be fundamentally wrong.
I am using a local Redis 2.8.2101 (x64), ioredis 1.6.1, node.js 0.12 on a relatively hardcore Windows 10 gaming machine (i5, 16GB RAM, 840 EVO SSD, ...) by the way.
Redis is great at doing lots of small operations,
but not so great at doing small numbers of "very big" operations.
I think you should re-evaluate your algorithm, and try to break apart your data in to smaller chunks. Not only you'll save the bandwidth, you'll also will not lock your redis instance long amounts of time.
Redis offers many data structures you should be able to use for more fine grain control over your data.
Well, still, in this case, since you are running the redis locally, and assuming you are not running anything else but this code, I doubt that the bandwidth, nor the redis is the problem. I'm more thinking this line:
JSON.stringify()
is the main culprit why you are seeing the slow execution.
JSON serialization of 20MB of string is not something simple,
The process needs allocate many small strings, and also has to go through all of your array and inspect each item individually. All of this will take a long time for a big object like this one.
Again, if you were breaking apart your data, and doing smaller operations with redis, you'd not need the JSON serializer at all.

datastax : Spark job fails : Removing BlockManager with no recent heart beats

Im using datastax-4.6. I have created a cassandra table and stored 2crore records. Im trying to read the data using scala. The code works fine for few records but when i try to retrieve all 2crore records it displays me follwing error.
**WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 172.20.98.17, 34224, 0) with no recent heart beats: 140948ms exceeds 45000ms
15/05/15 19:34:06 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(C15759,34224) not found**
Any help?
This problem is often tied to GC pressure
Tuning your Timeouts
Increase the spark.storage.blockManagerHeartBeatMs so that Spark waits for the GC pause to end.
SPARK-734 recommends setting -Dspark.worker.timeout=30000 -Dspark.akka.timeout=30000 -Dspark.storage.blockManagerHeartBeatMs=30000 -Dspark.akka.retry.wait=30000 -Dspark.akka.frameSize=10000
Tuning your jobs for your JVM
spark.cassandra.input.split.size - will allow you to change the level of parallelization of your cassandra reads. Bigger split sizes mean that more data will have to reside in memory at the same time.
spark.storage.memoryFraction and spark.shuffle.memoryFraction - amount of the heap that will be occupied by RDDs (as opposed to shuffle memory and spark overhead). If you aren't doing any shuffles, you could increase this value. The databricks guys say to make this similar in size to the size of your oldgen.
spark.executor.memory - Obviously this depends on your hardware. Per DataBricks you can do up to 55gb. Make sure to leave enough RAM for C* and for your OS and OS page cache. Remember that long GC pauses happen on larger heaps.
Out of curiosity, are you frequently going to be extracting your entire C* table with Spark? What's the use case?