parallel execution of dask `DataFrame.set_index()` - dataframe

I am trying to create an index on a large dask dataframe. No matter what scheduler I am unable to utilize more than the equivalent of one core for the operation. The code is:
(ddf.
.read_parquet(pq_in)
.set_index('title', drop=True, npartitions='auto', shuffle='disk', compute=False)
.to_parquet(pq_out, engine='fastparquet', object_encoding='json', write_index=True, compute=False)
.compute(scheduler=my_scheduler)
)
I am running this on a single 64-core machine. What can I do to utilize more cores? Or is set_index inherently sequential?

That should use multiple cores, though using disk for shuffling may introduce other bottlenecks like your local hard drive. Often you aren't bound by additional CPU cores.
In your situation I would use the distributed scheduler on a single machine so that you can use the diagnostic dashboard to get more insight about your computation.

Related

why some notes in spark works very slow? and why multiple execution in same situation has different execution time?

My question is about the execution time of pyspark codes in zeppelin.
I have some notes and I work with some SQL's in it. in one of my notes, I convert my dataframe to panda with .topandas() function. size of my data is about 600 megabyte.
my problem is that it takes a long time.
if I use sampling for example like this:
df.sample(False, 0.7).toPandas()
it works correctly and in an acceptable time.
the other strange point is when I run this note several times, sometimes it works fast and sometimes slow. for example for the first run after restart pyspark interpreter, it works faster.
how I can work with zeppelin in a stable state?
and which parameters are effective to run a spark code in an acceptable time?
The problem here is not zeppelin, but you as a programmer. Spark is a distributed (cluster computing) data analysis engine written in Scala which therefore runs in a JVM. Pyspark is the Python API for Spark which utilises the Py4j library to provide an interface for JVM objects.
Methods like .toPandas() or .collect() return a python object which is not just an interface for JVM objects (i.e. it actually contains your data). They are costly because they require to transfer your (distributed) data from the JVM to the python interpreter inside the spark driver. Therefore you should only use it when the resulting data is small and work as long as possible with pyspark dataframes.
Your other issue regarding different execution times needs to be discussed with your cluster admin. Network spikes and jobs submitted by other users can influence your execution time heavily. I am also surprised that your first run after a restart of the spark interpreter is faster, because during the first run the sparkcontext is created and cluster ressources are allocated which adds some overhead.

gathering a large dataframe back into master in dask distributed

I have a large (~180K row) dataframe for which
df.compute()
hangs when running dask with the distributed scheduler in local mode on an
AWS m5.12xlarge (98 cores).
All the worker remain nearly idle
However
df.head(df.shape[0].compute(), -1)
completes quickly, with good utilization of the available core.
Logically the above should be equivalent. What causes the difference?
Is there some parameter I should pass to compute in the first version to speed it up?
When you call .compute() you are asking for the entire result in your local process as a pandas dataframe. If that result is large then it might not fit. Do you need the entire result locally? If not then perhaps you wanted .persist() instead?

Can GraphDB execute a query in parallel on multiple cores?

I noticed that my queries are running faster on my local machine compare to my server because on both machines only one core of the CPU is being used. Is there a way to enable multi-threading so I can use 12 (or all 24 cores) instead of just one?
I didn't find anything in the documentation to set this up but saw that other graph databases do support it. If it is supported by default, what could cause it to only use a single core?
GraphDB by default will load all available CPU cores unless limited by the license type. The Free Edition has a limitation up to 2 concurrent read operations. However, I suspect that what you ask for is how to enable the query parallelism (decompose the query into smaller tasks and execute them in parallel).
Write operations in GDB SE/EE will always be split into multiple parallel tasks, so you will benefit from the multiple cores. GraphDB Free is limited to a single core due to commercial reasons.
Read operations are always executed on a single thread because in the general case the queries run faster. In some specific scenarios like heavy aggregates over large collections parallelizing the query execution may have substantial benefit, but this is currently not supported.
So to sum up having multiple cores will help you only handle more concurrent queries, but not process them faster. This design limitation may change in the upcoming versions.

Do we use Spark because it's faster or because it can handle large amount of data? [duplicate]

A Spark newbie here.
I recently started playing around with Spark on my local machine on two cores by using the command:
pyspark --master local[2]
I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.
However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.
I was wondering what could be a possible reason for this. I have a couple of thoughts.
Do built-in functions do the process of serialization/de-serialization inefficiently? If yes, what are the alternatives to them?
Is the data set too small that it cannot outrun the overhead cost of the underlying JVM on which spark runs?
Thanks for looking. Much appreciated.
Because:
Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Each of these properties has significant cost.
Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark).
Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
Because local mode is not designed for performance. It is used for testing.
Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution
Also Spark: Inconsistent performance number in scaling number of cores, Why is pyspark so much slower in finding the max of a column?, Why does my Spark run slower than pure Python? Performance comparison
You can go on like this for a long time...

Optimizing write performance of a 3 Node 8 Core/16G Cassandra cluster

We have setup a 3 node performance cluster with 16G RAM and 8 Cores each. Our use case is to write 1 million rows to a single table with 101 columns which is currently taking 57-58 mins for the write operation. What should be our first steps towards optimizing the write performance on our cluster?
The first thing I would do is look at the application that is performing the writes:
What language is the application written in and what driver is it using? Some drivers can offer better inherent performance than others. i.e. Python, Ruby, and Node.js drivers may only make use of one thread, so running multiple instances of your application (1 per core) may be something to consider. Your question is tagged 'spark-cassandra-connector' so that possibly indicates your are using that, which uses the datastax java driver, which should perform well as a single instance.
Are your writes asynchronous or are you writing data one at a time? How many writes does it execute concurrently? Too many concurrent writes could cause pressure in Cassandra, but not very many concurrent writes could reduce throughput. If you are using the spark connector are you using saveToCassandra/saveAsCassandraTable or something else?
Are you using batching? If you are, how many rows are you inserting/updating per batch? Too many rows could put a lot of pressure on cassandra. Additionally, are all of your inserts/updates going to the same partition within a batch? If they aren't in the same partition, you should consider batching them up.
Spark Connector Specific: You can tune the write settings, like batch size, batch level (i.e. by partition or by replica set), write throughput in mb per core, etc. You can see all these settings here.
The second thing I would look at is look at metrics on the cassandra side on each individual node.
What does the garbage collection metrics look like? You can enable GC logs by uncommenting lines in conf/cassandra-env.sh (As shown here). Are Your Garbage Collection Logs Speaking to You?. You may need to tune your GC settings, if you are using an 8GB heap the defaults are usually pretty good.
Do your cpu and disk utilization indicate that your systems are under heavy load? Your hardware or configuration could be constraining your capability Selecting hardware for enterprise implementations
Commands like nodetool cfhistograms and nodetool proxyhistograms will help you understand how long your requests are taking (proxyhistograms) and cfhistograms (latencies in particular) could give you insight into any other possibile disparities between how long it takes to process the request vs. perform mutation operations.