Persist slower than non-persist calls - apache-spark-sql

My settings are: Spark 2.1 on a 3 node YARN cluster with 160 GB, 48 vcores.
Dynamic allocation turned on.
spark.executor.memory=6G, spark.executor.cores=6
First, I am reading hive tables: orders (329MB) and lineitems (1.43GB) and
doing a left outer join.
Next, I apply 7 different filter conditions based on the joined
dataset (something like var line1 = joinedDf.filter("linenumber=1"), var line2 = joinedDf.filter("l_linenumber=2"), etc).
Because I'm filtering on the joined dataset multiple times, I thought doing a persist (MEMORY_ONLY) would help here as the joined dataset will fits fully in memory.
I noticed that with persist, the Spark application takes longer to run than without persist (3.5 mins vs 3.3 mins). With persist, the DAG shows that a single stage was created for persist and other downstream jobs are waiting for the persist to complete.
Does that mean persist is a blocking call? Or do stages in other jobs start processing when persisted blocks become available?
In the non-persist case, different jobs are creating different stages to read the same data. Data is read multiple times in different stages, but this is still is turning out to be faster than the persist case.
With larger data sets, persist actually causes executors to run out of
memory (Java heap space). Without persist, the Spark jobs complete just fine. I looked at some other suggestions here: Spark java.lang.OutOfMemoryError: Java heap space.
I tried increasing/decreasing executor cores, persisting
with disk only, increasing partitions, modifying the storage ratio, but nothing seems to help with executor memory issues.
I would appreciate it if someone could mention how persist works, in what cases it is faster than not-persisting and more importantly, how to go about troubleshooting out of memory issues.

I'd recommend reading up on the difference between transformations and actions in spark. I must admit that I've been bitten by this myself on multiple occasions.
Data in spark is evaluated lazily, which essentially means nothing happens until an "action" is performed. The .filter() function is a transformation, so nothing actually happens when your code reaches that point, except to add a section to the transformation pipeline. A call to .persist() behaves in the same way.
If your code downstream of the .persist() call has multiple actions that can be triggered simultaneously, then it's quite likely that you are actually "persisting" the data for each action separately, and eating up memory (The "Storage' tab in the Spark UI will tell you the % cached of the dataset, if it's more than 100% cached, then you are seeing what I describe here). Worse, you may never actually be using cached data.
Generally, if you have a point in code where the data set forks into two separate transformation pipelines (each of the separate .filter()s in your example), a .persist() is a good idea to prevent multiple readings of your data source, and/or to save the result of an expensive transformation pipeline before the fork.
Many times it's a good idea to trigger a single action right after the .persist() call (before the data forks) to ensure that later actions (which may run simultaneously) read from the persisted cache, rather than evaluate (and uselessly cache) the data independently.
TL;DR:
Do a joinedDF.count() after your .persist(), but before your .filter()s.

Related

Multiple step Pandas processing with Airflow

I have a multiple stage ETL transform stage using pandas. Basically, I load almost 2Gb of data from Mongodb and then I apply several functions in the columns. My question is if there's any way to break those transformations in multiple Airflow tasks.
The options I have considered are:
Creating a temporary table in Mongodb and loading/storing the transformed data frame between steps. I found this cumbersome and totally prone to a non-usual overhead due to disk I/O
Passing data among the tasks using XCom. I think this is a nice solution but I worry about the sheer size of the data. The docs explicitly state
Any object that can be pickled can be used as an XCom value, so users should make sure to use objects of appropriate size.
Using an in-memory storage between steps. Maybe saving the data in a Redis server or something, but I'm not really sure if that would be any better than just using XCom altogether.
So, does any of you have any tips on how to handle this situation? Thanks!

Control of parallelization

I am running a custom processor on a rowset that does not seem to run in parallel. The underlying ~1GB text file is first read into a table that is partitioned via round robin. The 'Extract' runs on 200 vertices but then (under 'Aggregate' node) the processing [that does various complex computations] happens on only 2 vertices even though the parallelism parameter is much higher than that. Is there a special hint that needs to be used to dictate the compiler to use more vertex? Is there a function or property that needs to be overridden to set the parallelism at this phase as well?
Sorry for the late reply. But it is vacation time :).
It is good to see that the extract phase is fully scaled out.
Without seeing the script or the generated plan it is a bit difficult to say why you only see 2 vertices in some places. There are a couple of reasons why that may be the case:
you don't have enough data to scale out to more.
your aggregation needs more data and thus the plan has less parallelism.
your operation is intrinsically less parallel.
The optimizer's data cardinality estimation is off and chooses not enough parallelism. We have some ability to hint, but I rather first see the job.
Note that custom processors often block the optimizer from pushing optimizations through in the script (using the READ ONLY option for example helps) and can throw off the cardinality estimations.
If you send me the script, the job graph and the link to the job to mrys at Microsoft, I and the team will look into it next week after the holidays are over.

Optimizing write performance of a 3 Node 8 Core/16G Cassandra cluster

We have setup a 3 node performance cluster with 16G RAM and 8 Cores each. Our use case is to write 1 million rows to a single table with 101 columns which is currently taking 57-58 mins for the write operation. What should be our first steps towards optimizing the write performance on our cluster?
The first thing I would do is look at the application that is performing the writes:
What language is the application written in and what driver is it using? Some drivers can offer better inherent performance than others. i.e. Python, Ruby, and Node.js drivers may only make use of one thread, so running multiple instances of your application (1 per core) may be something to consider. Your question is tagged 'spark-cassandra-connector' so that possibly indicates your are using that, which uses the datastax java driver, which should perform well as a single instance.
Are your writes asynchronous or are you writing data one at a time? How many writes does it execute concurrently? Too many concurrent writes could cause pressure in Cassandra, but not very many concurrent writes could reduce throughput. If you are using the spark connector are you using saveToCassandra/saveAsCassandraTable or something else?
Are you using batching? If you are, how many rows are you inserting/updating per batch? Too many rows could put a lot of pressure on cassandra. Additionally, are all of your inserts/updates going to the same partition within a batch? If they aren't in the same partition, you should consider batching them up.
Spark Connector Specific: You can tune the write settings, like batch size, batch level (i.e. by partition or by replica set), write throughput in mb per core, etc. You can see all these settings here.
The second thing I would look at is look at metrics on the cassandra side on each individual node.
What does the garbage collection metrics look like? You can enable GC logs by uncommenting lines in conf/cassandra-env.sh (As shown here). Are Your Garbage Collection Logs Speaking to You?. You may need to tune your GC settings, if you are using an 8GB heap the defaults are usually pretty good.
Do your cpu and disk utilization indicate that your systems are under heavy load? Your hardware or configuration could be constraining your capability Selecting hardware for enterprise implementations
Commands like nodetool cfhistograms and nodetool proxyhistograms will help you understand how long your requests are taking (proxyhistograms) and cfhistograms (latencies in particular) could give you insight into any other possibile disparities between how long it takes to process the request vs. perform mutation operations.

Fine tuning oracle query with pipelined function

I have a query (that powers an Oracle Application Express Report) that I was told by our users was executing "slowly" or at an unacceptable speed (wasn't given an actual load time for the page and the query is the only thing on the page).
The query involves many tables and actually references a pipelined function which identifies the currently logged-in users to our website and returns a custom "table" of records they have permission to based upon a custom security scheme we have.
My main question is around Oracle's caching of queries and how they could be affected by our setup.
When I took the query out of the webpage and ran it in Sql Developer (and manually specified a user ID to simulate a logged-in user to the website), the performance went from 71 seconds to 19 seconds to .5 seconds. Clearly, Oracle is utilizing its caching mechanism to make subsequent runs faster.
How is this affected by?:
The fact that different users will get different tables from the
pipe-lined function (all the same columns, just different number of
rows and the values in the rows). Does the pipe-lining prevent
caching from working? Am I only seeing caching because I'm running
a very isolated test?
Further more - is caching easily influenced by the number of people using the system? I'm not sure how "much" can get cached. Therefore, if we have 50 concurrent users that are accessing different parts of the website that are loading different queries all day long, is it likely that oracle won't be able to cache many/any of them because it is constantly seeing different request for queries?
Sorry my question isn't very technical.
I'm a developer who has been asked to help out in this seemingly DBA question.
Also, this is complicated because I can't really determine what the actual load times are since our users don't report that level of detail.
Any thoughts on:
how I can determine if this query is actually slow?
what the average processing time would be?
and how to proceed with fine tuning if it is a problem?
Thanks!
It doesn't sound like this has anything to do with APEX, pipelined table functions, or query caching. It sounds like you are describing the effects of plain old data caching (most likely at the database level but potentially at the operating system and disk subsystem layers).
As a very basic overview, data is stored in rows, rows are stored in blocks (most commonly 8 kb in size), blocks are stored in extents (generally a few MB in size), and extents roll up to segments (i.e. a table). Oracle maintains a buffer cache where the most recently accessed blocks are stored. When you run a query, Oracle figures out which blocks it needs to read in order to get your data (this is the query plan). It then looks to see whether those blocks are in the buffer cache or whether they have to be read from disk. Obviously, reading a block from cache is much more efficient than reading it off the disk since RAM is much faster than disk. If you run the same query with the same set of bind variable values multiple times in a row, you'll be accessing the same set of blocks each time but more and more of the blocks you care about are going to be in the cache. So you'd generally expect that the second and third time that you call the query, you'll see faster performance.
If you run the query with a different set of bind variable values, if the second set of bind variable values causes Oracle to access many of the same blocks, those executions will benefit from the data the prior test cached. Otherwise, you'd be back to square 1 potentially reading all the data you need off disk. Most likely, you'll see some combination of the two.
Remember as well that it is not just Oracle that is caching data. Frequently, the operating system will be caching the most active pieces of the underlying Oracle data files. And the I/O subsystem will be caching the most recently accessed data as well. So even if Oracle thinks that it needs to go out to fetch a block because it is not in the database's buffer cache, the file system or the I/O subsystem may have cached that data so it may not require an actual physical read off of disk. These other caches behave similarly where running the same query multiple times in a row is likely to cause the cache to be "warm" and improve the performance of the later runs.

updating 2 800 000 records with 4 threads

I have a VB.net application with an Access Database with one table that contains about 2,800,000 records, each raw is updated with new data daily. The machine has 64GB of ram and i7 3960x and its over clocked to 4.9GHz.
Note: data sources are local.
I wonder if I use ~10 threads will it finish updating the data to the rows faster.
If it is possiable what would be the mechanisim of deviding this big loop to multiple threads?
Update: Sometimes the loop has to repeat the calculation for some row depending on results also the loop have exacly 63 conditions and its 242 lines of code.
Microsoft Access is not particularly good at handling many concurrent updates, compared to other database platforms.
The more your tasks need to do calculations, the more you will typically benefit from concurrency / threading. If you spin up 10 threads that do little more than send update commands to Access, it is unlikely to be much faster than it is with just one thread.
If you have to do any significant calculations between reading and writing data, threads may show a performance improvement.
I would suggest trying the following and measuring the result:
One thread to read data from Access
One thread to perform whatever calculations are needed on the data you read
One thread to update Access
You can implement this using a Producer / Consumer pattern, which is pretty easy to do with a BlockingCollection.
The nice thing about the Producer / Consumer pattern is that you can add more producer and/or consumer threads with minimal code changes to find the sweet spot.
Supplemental Thought
IO is probably the bottleneck of your application. Consider placing the Access file on faster storage if you can (SSD, RAID, or even a RAM disk).
Well if you're updating 2,800,000 records with 2,800,000 queries, it will definitely be slow.
Generally, it's good to avoid opening multiple connections to update your data.
You might want to show us some code of how you're currently doing it, so we could tell you what to change.
So I don't think (with the information you gave) that going multi-thread for this would be faster. Now, if you're thinking about going multi-thread because the update freezes your GUI, now that's another story.
If the processing is slow, I personally don't think it's due to your servers specs. I'd guess it's more something about the logic you used to update the data.
Don't wonder, test. Write it so you could dispatch as much threads to make the work and test it with various numbers of threads. What does the loop you are talking about look like?
With questions like "if I add more threads, will it work faster"? it is always best to test, though there are rule of thumbs. If the DB is local, chances are that Oded is right.