How to Let Spark Handle Bigger Data Sets?

How to Let Spark Handle Bigger Data Sets? - sql

I have a very complex query that needs to join 9 or more tables with some 'group by' expressions . Most of these tables have almost the same of numbers of the rows. These tables also have some columns that can be used as the 'key' to partition the tables.
Previously, the app ran fine, but now the data set has 3~4 times data as before. My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling and the application stalls (no matter how I adjust the memory, partition, executors, etc.). The actual data probably is just dozens of Gs.
I would think that if the partitioning works properly, Spark shouldn't do shuffle so much and the join should be done on each node. It is puzzling that why Spark is not so 'smart' to do so.
I could split the data set (with the 'key' I mentioned above) into many data sets that these data sets can be dealt with independently. But the burden will be on myself...it discounts the very reason to use Spark. What other approaches that could help?
I use Spark 2.0 over Hadoop YARN.

My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling
When joining datasets if the size of one side is less than a certain configurable size, spark broadcasts the entire table to each executor so that join may be performed locally everywhere. Your above observation is consistent with this. You can also provide broadcast hint explicitly to the spark, like so df1.join(broadcast(df2))
Other than that, can you please provide more specifics about your problem?
[Sometime ago I was also grappling with the issue of join and shuffles for one of our jobs that had to handle couple of TBs. We were using RDDs (and not the dataset api). I wrote about my findings [here]1. These may be of some use to you are try to reason about the underlying data shuffle.]
Update: According to documentation -- spark.sql.autoBroadcastJoinThreshold is the configurable property key. 10 MB is its default value. And it does the following:
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.
So apparently, this is supported only for the Hive tables.

Related

the faster way to extract all records from oracle

I have oracle table contain 900 million records , this table partioned to 24 partion , and have indexes :
i try to using hint and i put fetch_buffer to 100000:
select /+ 8 parallel +/
* from table
it take 30 minutes to get 100 million records
my question is :
is there are any way more faster to get the 900 million (all data in the table ) ? should i use partions and did 24 sequential queries ? or should i use indexes and split my query to 10 queries for example

The network is almost certainly the bottleneck here. Oracle parallelism only impacts the way the database retrieves the data, but data is still sent to the client with a single thread.
Assuming a single thread doesn't already saturate your network, you'll probably want to build a concurrent retrieval solution. It helps that the table is already partitioned, then you can read large chunks of data without re-reading anything.
I'm not sure how to do this in Scala, but you want to run multiple queries like this at the same time, to use all the client and network resources possible:
select * from table partition (p1);
select * from table partition (p2);
...

Not really an answer but too long for a comment.
A few too many variables can impact this to give informed advice, so the following are just some general hints.
Is this over a network or local on the server? If the database is remote server then you are paying a heavy network price. I would suggest (if possible) running the extract on the server using the BEQUEATH protocol to avoid using the network. Once the file(s) complete, is will be quicker to compress and transfer to destination than transferring the data direct from database to local file via JDBC row processing.
With JDBC remember to set the cursor fetch size to reduce round tripping - setFetchSize. The default value is tiny (10 I think), try something like 1000 to see how that helps.
As for the query, you are writing to a file so even though Oracle might process the query in parallel, your write to file process probably doesn't so it's a bottleneck.
My approach would be to write the Java program to operate off a range of values as command line parameters, and experiment to find which range size and concurrent instances of the Java give optimal performance. The range will likely fall within discrete partitions so you will benefit from partition pruning (assuming the range value is an a indexed column ideally the partition key).
Roughly speaking I would start with range of 5m, and run concurrent instances that match the number of CPU cores - 2; this is not a scientifically derive number just one that I tend to use as my first stab and see what happens.

Spark SQL .distinct() performance

I want to source few hundreds of gigabytes from a database via JDBC and then process it using Spark SQL. Currently I am doing some partitioning at that data and process is by batches of milion records. The thing is that I would like also to apply some deduplication to my dataframes and I was going to leave that idea of separated batches processing and try to process those hundreds of gigabytes using a one dataframe partitioned accordingly.
The main concern is: how will .distinct() work in such case? Will Spark SQL firstly try to load ALL the data into the RAM and then apply deduplication involving many shuffles and repartitioning? Do I have to ensure that a cluster has enough of RAM to contain that raw data or maybe it will be able to help itself with HDD storage (thus killing the performance)?
Or maybe I should do it without Spark - move the data to the target storage and there apply distinct counts and detect duplicates and get rid off them?

Spark SQL does NOT use predicate pushdown for distinct queries; meaning that the processing to filter out duplicate records happens at the executors, rather than at the database. So, your assumption regarding shuffles happening over at the executors to process distinct is correct.
Inspite of this, I would still advise you to go ahead and perform the de-duplication on Spark, rather than build a separate arrangement for it. My personal experience with distinct has been more than satisfactory. It has always been the joins which push my buttons.

ReducyByKeyAndWindowByCount in Spark stateful streaming aggegations

I've to inner join two relational tables extracted from Oracle.
Actually i want to perform 1-to-1 join to get one row per primary key with aggegated in list values from the second table. So before joining 1-to-1 two tables i have to reduce all my rows by key to a 1 with values kept in the list.
Here is the illustration of what i need:
[![tables aggregation][1]][1]
And here i've met a problem which is when to stop aggegation for my key and pass aggegated entity to the next step. Spark offers solutions for that by providing window intervals and watermaking for late data. And so assumption for keeping data consistency is the time it receives the data. It is feasible and applicable for infinite datasets but in my case i exactly know the count of aggegations for each key. For exampe for customer_id 1000 i know exactly that there are only 3 products and after i've aggegated 3 products i know that i can stop aggegation now and go to the next streaming step in my pipeline. How can this solution be implemented using Spark and streaming? I know there is reduceByKeyAndWindow operation but in my case i need something like reduceByKeyAndWindowByCount.
Count will be stored in a static dataset or simply store it in a row as an additional data.

Finally we've decided to switch from streaming to core spark with batch processing because we have finite dataset and that thing works well for our use case. We've came to a conclusion that spark streaming was designed for processing continuous (which was actually obvious only from it's naming) datasets. And thats why we have only window intervals by time and watermarks to correct network or other delays during transportation. We've also found our design with counters ugly, complex and in the other words bad. It is a live example of a bad design and such growing complexity was a marker that we were moving in the wrong direction and were trying to use a tool for a purpose it was not designed for.

Efficiently perform COUNT DISTINCT with spark, on csvs?

I have a large volume of data, and I'm looking to efficiently (ie, using a relatively small Spark cluster) perform COUNT and DISTINCT operations one of the columns.
If I do what seems obvious, ie load the data into a dataframe:
df = spark.read.format("CSV").load("s3://somebucket/loadsofcsvdata/*").toDF()
df.registerView("someview")
and then attempt to run a query:
domains = sqlContext.sql("""SELECT domain, COUNT(id) FROM someview GROUP BY domain""")
domains.take(1000).show()
my cluster just crashes and burns - throwing out of memory exceptions or otherwise hanging/crashing/not completing the operation.
I'm guessing that somewhere along the way there's some sort of join that blows one of the executors' memory?
What's the ideal method for performing an operation like this, when the source data is at massive scale and the target data isn't (the list of domains in the above query is relatively short, and should easily fit in memory)
related info available at this question: What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

I would suggest to tune your executors settings. Especially, setting following parameters correctly can provide dramatic improvement in performance.
spark.executor.instances
spark.executor.memory
spark.yarn.executor.memoryOverhead
spark.executor.cores
In your case, I would also suggest to tune Number of partitions, especially bump up following param from default 200 to higher value, as per requirement.
spark.sql.shuffle.partitions

BigQuery range decorator duplicate issue

We are facing issues with BigQuery range decorators on streaming table. The range decorator queries give duplicate data.
My case:
My BQ table is getting data regularly from customer events through streaming inserts. Another job is periodically fetching time bound data from the table using range decorator and sending it to dataflow jobs. like
First time fetching all the data from table using
SELECT * FROM [project_id:alpha.user_action#1450287482158]
when i ran this query got 91 records..
after 15 mins another query based on last interval
SELECT * FROM [alpha.user_action#1450287482159-1450291802380]
this also gave the same result with 91 records.
however i tried to run the same query again to cross check
SELECT * FROM [project_id:alpha.user_action#1450287482158]
Gives empty data.
any help on this?

First off, have you tried using streaming dataflow? That might be a better fit (though your logic is not expressible as a query). Streaming dataflow also supports Tee-ing your writes, so you can keep both raw data and aggregate results.
On to your question:
Unfortunately this is a collision of two concepts that were built concurrently and somewhat independently, thus resulting in ill-defined interactions.
Time range table decorators were designed/built in a world where only load jobs existed. As such, blocks of data are atomically committed to a table at a single point in time. Time range decorators work quite well with this, as there are clear boundaries of inclusion/exclusion, and the relationship is stable.
Streaming Ingestion + realtime query is somewhat counter to the "load job" world. BigQuery is buffering your data for some period of time, making it available for analysis, and then periodically flushing the buffers onto the table using the traditional storage means. While the data is buffered, we have "infinite" time granularity. However, once we flush the buffer onto the table, that infinite granularity is compressed into a single time, which is currently the flush time.
Thus, using time range decorators on streaming tables can unfortunately result in some unexpected behaviors, as the same data may appear in two non-overlapping time windows (once while it is buffered, and once when it is flushed).
Our recommendation if you're trying to achieve windowed queries over recent data is to do the following:
Include a system timestamp in your data.
For the table decorator timestamps, include some buffer around the actual window to account for clock skew between your clock and Google's, and late arrivals from retry. This buffer should be both before and after your target window.
Modify your query to apply your actual time window.
It should be noted that depending on your actual usage purpose, this may not address your problems. If you can give more details, there might be a way to achieve what you need.
Sorry for the inconvenience.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas