Efficiently perform COUNT DISTINCT with spark, on csvs? - optimization

I have a large volume of data, and I'm looking to efficiently (ie, using a relatively small Spark cluster) perform COUNT and DISTINCT operations one of the columns.
If I do what seems obvious, ie load the data into a dataframe:
df = spark.read.format("CSV").load("s3://somebucket/loadsofcsvdata/*").toDF()
df.registerView("someview")
and then attempt to run a query:
domains = sqlContext.sql("""SELECT domain, COUNT(id) FROM someview GROUP BY domain""")
domains.take(1000).show()
my cluster just crashes and burns - throwing out of memory exceptions or otherwise hanging/crashing/not completing the operation.
I'm guessing that somewhere along the way there's some sort of join that blows one of the executors' memory?
What's the ideal method for performing an operation like this, when the source data is at massive scale and the target data isn't (the list of domains in the above query is relatively short, and should easily fit in memory)
related info available at this question: What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

I would suggest to tune your executors settings. Especially, setting following parameters correctly can provide dramatic improvement in performance.
spark.executor.instances
spark.executor.memory
spark.yarn.executor.memoryOverhead
spark.executor.cores
In your case, I would also suggest to tune Number of partitions, especially bump up following param from default 200 to higher value, as per requirement.
spark.sql.shuffle.partitions

Related

Spark SQL .distinct() performance

I want to source few hundreds of gigabytes from a database via JDBC and then process it using Spark SQL. Currently I am doing some partitioning at that data and process is by batches of milion records. The thing is that I would like also to apply some deduplication to my dataframes and I was going to leave that idea of separated batches processing and try to process those hundreds of gigabytes using a one dataframe partitioned accordingly.
The main concern is: how will .distinct() work in such case? Will Spark SQL firstly try to load ALL the data into the RAM and then apply deduplication involving many shuffles and repartitioning? Do I have to ensure that a cluster has enough of RAM to contain that raw data or maybe it will be able to help itself with HDD storage (thus killing the performance)?
Or maybe I should do it without Spark - move the data to the target storage and there apply distinct counts and detect duplicates and get rid off them?
Spark SQL does NOT use predicate pushdown for distinct queries; meaning that the processing to filter out duplicate records happens at the executors, rather than at the database. So, your assumption regarding shuffles happening over at the executors to process distinct is correct.
Inspite of this, I would still advise you to go ahead and perform the de-duplication on Spark, rather than build a separate arrangement for it. My personal experience with distinct has been more than satisfactory. It has always been the joins which push my buttons.

How to Let Spark Handle Bigger Data Sets?

I have a very complex query that needs to join 9 or more tables with some 'group by' expressions . Most of these tables have almost the same of numbers of the rows. These tables also have some columns that can be used as the 'key' to partition the tables.
Previously, the app ran fine, but now the data set has 3~4 times data as before. My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling and the application stalls (no matter how I adjust the memory, partition, executors, etc.). The actual data probably is just dozens of Gs.
I would think that if the partitioning works properly, Spark shouldn't do shuffle so much and the join should be done on each node. It is puzzling that why Spark is not so 'smart' to do so.
I could split the data set (with the 'key' I mentioned above) into many data sets that these data sets can be dealt with independently. But the burden will be on myself...it discounts the very reason to use Spark. What other approaches that could help?
I use Spark 2.0 over Hadoop YARN.
My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling
When joining datasets if the size of one side is less than a certain configurable size, spark broadcasts the entire table to each executor so that join may be performed locally everywhere. Your above observation is consistent with this. You can also provide broadcast hint explicitly to the spark, like so df1.join(broadcast(df2))
Other than that, can you please provide more specifics about your problem?
[Sometime ago I was also grappling with the issue of join and shuffles for one of our jobs that had to handle couple of TBs. We were using RDDs (and not the dataset api). I wrote about my findings [here]1. These may be of some use to you are try to reason about the underlying data shuffle.]
Update: According to documentation -- spark.sql.autoBroadcastJoinThreshold is the configurable property key. 10 MB is its default value. And it does the following:
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.
So apparently, this is supported only for the Hive tables.

Understanding "Resources exceeded during query execution" with GROUP EACH BY in BigQuery

I'm writing a background job to automatically process A/B test data in BigQuery, and I'm finding that I'm hitting "Resources exceeded during query execution" when doing large GROUP EACH BY statements. I saw from Resources Exceeded during query execution that reducing the number of groups can make queries succeed, so I split up my data into smaller pieces, but I'm still hitting errors (although less frequently). It would be nice to get a better intuition about what actually causes this error. In particular:
Does "resources exceeded" always mean that a shard ran out of memory, or could it also mean that the task ran out of time?
What's the right way to approximate the memory usage and the total memory I have available? Am I correct in assuming each shard tracks about 1/n of the groups and keeps the group key and all aggregates for each group, or is there another way that I should be thinking about it?
How is the number of shards determined? In particular, do I get fewer shards/resources if I'm querying over a smaller dataset?
The problematic query looks like this (in practice, it's used as a subquery, and the outer query aggregates the results):
SELECT
alternative,
snapshot_time,
SUM(column_1),
...
SUM(column_139)
FROM
my_table
CROSS JOIN
[table containing 24 unix timestamps] timestamps
WHERE last_updated_time < timestamps.snapshot_time
GROUP EACH BY alternative, user_id, snapshot_time
(Here's an example failed job: 124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4 )
I realize this query may be asking for trouble, but in this case, the table is only 22MB and the query results in under a million groups and it's still failing with "resources exceeded". Reducing the number of timestamps to process at once fixes the error, but I'm worried that I'll eventually hit a data scale large enough that this approach as a whole will stop working.
As you've guessed, BigQuery chooses a number of parallel workers (shards) for GROUP EACH and JOIN EACH queries based on the size of the tables being operated upon. It is a rough heuristic, but in practice, it works pretty well.
What is interesting about your query is that the GROUP EACH is being done over a larger table than the original table because of the expansion in the CROSS JOIN. Because of this, we choose a number of shards that is too small for your query.
To answer your specific questions:
Resources exceeded almost always means that a worker ran out of memory. This could be a shard or a mixer, in Dremel terms (mixers are the nodes in the computation tree that aggregate results. GROUP EACH BY pushes aggregation down to the shards, which are the leaves of the computation tree).
There isn't a good way to approximate the amount of resources available. This changes over time, with the goal that more of your queries should just work.
The number of shards is determined by the total bytes processed in the query. As you've noticed, this heuristic doesn't work well with joins that expand the underlying data sets. That said, there is active work underway to be smarter about how we pick the number of shards. To give you an idea of scale, your query got scheduled on only 20 shards, which is a tiny fraction of what a larger table would get.
As a workaround, you could save the intermediate result of the CROSS JOIN as a table, and running the GROUP EACH BY over that temporary table. That should let BigQuery use the expanded size when picking the number of shards. (if that doesn't work, please let me know, it is possible that we need to tweak our assignment thresholds).

How do you improve performance on a pig job that has very skewed data?

I am running a pig script that performs a GROUP BY and a nested FOREACH that takes hours to run due to one or two reduce tasks. For example:
B = GROUP A BY (fld1, fld2) parallel 50;
C = FOREACH B {
U = A.fld1;
DIST = DISTINCT U;
GENERATE FLATTEN(group), COUNT_STAR(DIST);
}
Upon examining the counters for the slow tasks, I realized that it looks like the two reducers are processing through a lot more data than the other tasks. Basically, my understanding is that the data is very skewed and so the tasks that are "slow" are in fact doing more work than the fast tasks. I'm just wondering how to improve performance? I hate increasing the parallelism to try to split up the work but is that the only way?
The first option is to use a custom partitioner. Check out the documentation on GROUP for more info (check out PARTITION BY, specifically). Unfortunately, you probably have to write your own custom partitioner here. In your custom partitioner, send the first huge set of keys to reducer 0, send the next set to reducer 1, then do the standard hash partitioning across what's left. What this does is lets one reducer handle the big ones exclusively, while the others get multiple sets of keys. This doesn't always solve the problem with bad skew, though.
How valuable is the count for those two huge sets of data? I see huge skew a lot when things like NULL or empty string. If they aren't that valuable, filter them out before the GROUP BY.

libpq very slow for large (20 million record) database

I am new to SQL/RDBMS.
I have an application which adds rows with 10 columns in PostgreSQL server using the libpq library. Right now, my server is running on same machine as my visual c++ application.
I have added around 15-20 million records. The simple query of getting total count is taking 4-5 minutes using select count(*) from <tableName>;.
I have indexed my table with the time I am entering the data (timecode). Most of the time I need count with different WHERE / AND clauses added.
Is there any way to make things fast? I need to make it as fast as possible because once the server moves to network, things will become much slower.
Thanks
I don't think network latency will be a large factor in how long your query takes. All the processing is being done on the PostgreSQL server.
The PostgreSQL MVCC design means each row in the table - not just the index(es) - must be walked to calculate the count(*) which is an expensive operation. In your case there are a lot of rows involved.
There is a good wiki page on this topic here http://wiki.postgresql.org/wiki/Slow_Counting with suggestions.
Two suggestions from this link, one is to use an index column:
select count(index-col) from ...;
... though this only works under some circumstances.
If you have more than one index see which one has the least cost by using:
EXPLAIN ANALYZE select count(index-col) from ...;
If you can live with an approximate value, another is to use a Postgres specific function for an approximate value like:
select reltuples from pg_class where relname='mytable';
How good this approximation is depends on how often autovacuum is set to run and many other factors; see the comments.
Consider pg_relation_size('tablename') and divide it by the seconds spent in
select count(*) from tablename
That will give the throughput of your disk(s) when doing a full scan of this table. If it's too low, you want to focus on improving that in the first place.
Having a good I/O subsystem and well performing operating system disk cache is crucial for databases.
The default postgres configuration is meant to not consume too much resources to play nice with other applications. Depending on your hardware and the overall utilization of the machine, you may want to adjust several performance parameters way up, like shared_buffers, effective_cache_size or work_mem. See the docs for your specific version and the wiki's performance optimization page.
Also note that the speed of select count(*)-style queries have nothing to do with libpq or the network, since only one resulting row is retrieved. It happens entirely server-side.
You don't state what your data is, but normally the why to handle tables with a very large amount of data is to partition the table. http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
This will not speed up your select count(*) from <tableName>; query, and might even slow it down, but if you are normally only interested in a portion of the data in the table this can be helpful.