gathering a large dataframe back into master in dask distributed - dataframe

I have a large (~180K row) dataframe for which
df.compute()
hangs when running dask with the distributed scheduler in local mode on an
AWS m5.12xlarge (98 cores).
All the worker remain nearly idle
However
df.head(df.shape[0].compute(), -1)
completes quickly, with good utilization of the available core.
Logically the above should be equivalent. What causes the difference?
Is there some parameter I should pass to compute in the first version to speed it up?

When you call .compute() you are asking for the entire result in your local process as a pandas dataframe. If that result is large then it might not fit. Do you need the entire result locally? If not then perhaps you wanted .persist() instead?

Related

why some notes in spark works very slow? and why multiple execution in same situation has different execution time?

My question is about the execution time of pyspark codes in zeppelin.
I have some notes and I work with some SQL's in it. in one of my notes, I convert my dataframe to panda with .topandas() function. size of my data is about 600 megabyte.
my problem is that it takes a long time.
if I use sampling for example like this:
df.sample(False, 0.7).toPandas()
it works correctly and in an acceptable time.
the other strange point is when I run this note several times, sometimes it works fast and sometimes slow. for example for the first run after restart pyspark interpreter, it works faster.
how I can work with zeppelin in a stable state?
and which parameters are effective to run a spark code in an acceptable time?
The problem here is not zeppelin, but you as a programmer. Spark is a distributed (cluster computing) data analysis engine written in Scala which therefore runs in a JVM. Pyspark is the Python API for Spark which utilises the Py4j library to provide an interface for JVM objects.
Methods like .toPandas() or .collect() return a python object which is not just an interface for JVM objects (i.e. it actually contains your data). They are costly because they require to transfer your (distributed) data from the JVM to the python interpreter inside the spark driver. Therefore you should only use it when the resulting data is small and work as long as possible with pyspark dataframes.
Your other issue regarding different execution times needs to be discussed with your cluster admin. Network spikes and jobs submitted by other users can influence your execution time heavily. I am also surprised that your first run after a restart of the spark interpreter is faster, because during the first run the sparkcontext is created and cluster ressources are allocated which adds some overhead.

parallel execution of dask `DataFrame.set_index()`

I am trying to create an index on a large dask dataframe. No matter what scheduler I am unable to utilize more than the equivalent of one core for the operation. The code is:
(ddf.
.read_parquet(pq_in)
.set_index('title', drop=True, npartitions='auto', shuffle='disk', compute=False)
.to_parquet(pq_out, engine='fastparquet', object_encoding='json', write_index=True, compute=False)
.compute(scheduler=my_scheduler)
)
I am running this on a single 64-core machine. What can I do to utilize more cores? Or is set_index inherently sequential?
That should use multiple cores, though using disk for shuffling may introduce other bottlenecks like your local hard drive. Often you aren't bound by additional CPU cores.
In your situation I would use the distributed scheduler on a single machine so that you can use the diagnostic dashboard to get more insight about your computation.

Do we use Spark because it's faster or because it can handle large amount of data? [duplicate]

A Spark newbie here.
I recently started playing around with Spark on my local machine on two cores by using the command:
pyspark --master local[2]
I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.
However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.
I was wondering what could be a possible reason for this. I have a couple of thoughts.
Do built-in functions do the process of serialization/de-serialization inefficiently? If yes, what are the alternatives to them?
Is the data set too small that it cannot outrun the overhead cost of the underlying JVM on which spark runs?
Thanks for looking. Much appreciated.
Because:
Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Each of these properties has significant cost.
Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark).
Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
Because local mode is not designed for performance. It is used for testing.
Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution
Also Spark: Inconsistent performance number in scaling number of cores, Why is pyspark so much slower in finding the max of a column?, Why does my Spark run slower than pure Python? Performance comparison
You can go on like this for a long time...

Persist slower than non-persist calls

My settings are: Spark 2.1 on a 3 node YARN cluster with 160 GB, 48 vcores.
Dynamic allocation turned on.
spark.executor.memory=6G, spark.executor.cores=6
First, I am reading hive tables: orders (329MB) and lineitems (1.43GB) and
doing a left outer join.
Next, I apply 7 different filter conditions based on the joined
dataset (something like var line1 = joinedDf.filter("linenumber=1"), var line2 = joinedDf.filter("l_linenumber=2"), etc).
Because I'm filtering on the joined dataset multiple times, I thought doing a persist (MEMORY_ONLY) would help here as the joined dataset will fits fully in memory.
I noticed that with persist, the Spark application takes longer to run than without persist (3.5 mins vs 3.3 mins). With persist, the DAG shows that a single stage was created for persist and other downstream jobs are waiting for the persist to complete.
Does that mean persist is a blocking call? Or do stages in other jobs start processing when persisted blocks become available?
In the non-persist case, different jobs are creating different stages to read the same data. Data is read multiple times in different stages, but this is still is turning out to be faster than the persist case.
With larger data sets, persist actually causes executors to run out of
memory (Java heap space). Without persist, the Spark jobs complete just fine. I looked at some other suggestions here: Spark java.lang.OutOfMemoryError: Java heap space.
I tried increasing/decreasing executor cores, persisting
with disk only, increasing partitions, modifying the storage ratio, but nothing seems to help with executor memory issues.
I would appreciate it if someone could mention how persist works, in what cases it is faster than not-persisting and more importantly, how to go about troubleshooting out of memory issues.
I'd recommend reading up on the difference between transformations and actions in spark. I must admit that I've been bitten by this myself on multiple occasions.
Data in spark is evaluated lazily, which essentially means nothing happens until an "action" is performed. The .filter() function is a transformation, so nothing actually happens when your code reaches that point, except to add a section to the transformation pipeline. A call to .persist() behaves in the same way.
If your code downstream of the .persist() call has multiple actions that can be triggered simultaneously, then it's quite likely that you are actually "persisting" the data for each action separately, and eating up memory (The "Storage' tab in the Spark UI will tell you the % cached of the dataset, if it's more than 100% cached, then you are seeing what I describe here). Worse, you may never actually be using cached data.
Generally, if you have a point in code where the data set forks into two separate transformation pipelines (each of the separate .filter()s in your example), a .persist() is a good idea to prevent multiple readings of your data source, and/or to save the result of an expensive transformation pipeline before the fork.
Many times it's a good idea to trigger a single action right after the .persist() call (before the data forks) to ensure that later actions (which may run simultaneously) read from the persisted cache, rather than evaluate (and uselessly cache) the data independently.
TL;DR:
Do a joinedDF.count() after your .persist(), but before your .filter()s.

Optimizing write performance of a 3 Node 8 Core/16G Cassandra cluster

We have setup a 3 node performance cluster with 16G RAM and 8 Cores each. Our use case is to write 1 million rows to a single table with 101 columns which is currently taking 57-58 mins for the write operation. What should be our first steps towards optimizing the write performance on our cluster?
The first thing I would do is look at the application that is performing the writes:
What language is the application written in and what driver is it using? Some drivers can offer better inherent performance than others. i.e. Python, Ruby, and Node.js drivers may only make use of one thread, so running multiple instances of your application (1 per core) may be something to consider. Your question is tagged 'spark-cassandra-connector' so that possibly indicates your are using that, which uses the datastax java driver, which should perform well as a single instance.
Are your writes asynchronous or are you writing data one at a time? How many writes does it execute concurrently? Too many concurrent writes could cause pressure in Cassandra, but not very many concurrent writes could reduce throughput. If you are using the spark connector are you using saveToCassandra/saveAsCassandraTable or something else?
Are you using batching? If you are, how many rows are you inserting/updating per batch? Too many rows could put a lot of pressure on cassandra. Additionally, are all of your inserts/updates going to the same partition within a batch? If they aren't in the same partition, you should consider batching them up.
Spark Connector Specific: You can tune the write settings, like batch size, batch level (i.e. by partition or by replica set), write throughput in mb per core, etc. You can see all these settings here.
The second thing I would look at is look at metrics on the cassandra side on each individual node.
What does the garbage collection metrics look like? You can enable GC logs by uncommenting lines in conf/cassandra-env.sh (As shown here). Are Your Garbage Collection Logs Speaking to You?. You may need to tune your GC settings, if you are using an 8GB heap the defaults are usually pretty good.
Do your cpu and disk utilization indicate that your systems are under heavy load? Your hardware or configuration could be constraining your capability Selecting hardware for enterprise implementations
Commands like nodetool cfhistograms and nodetool proxyhistograms will help you understand how long your requests are taking (proxyhistograms) and cfhistograms (latencies in particular) could give you insight into any other possibile disparities between how long it takes to process the request vs. perform mutation operations.