Tableau on Spark Sql performance benchmarks - apache-spark-sql

Are there benchmarks for SparkSql being used as the primary reporting warehouse for Tableau? Size of the reporting warehouse is in a few TBs.
How does SparkSql's performance compare against Redshift, Exasaol and Presto for use with Tableau? What are the challenges faced?
There are benchmarks for SparkSql's query performance against many of them here. I am looking for case studies, benchmarks as well as any issues faced specifically with Tableau.

Related

Spark SQL vs Hive vs Presto SQL for analytics on top of Parquet file

I have terabytes of data stored in Parquet format for analytics use case. There are multiple big tables which needs joins as well and there are heavy queries. The system is expected to be highly scalable. Currently, evaluating Spark SQL, Hive and Presto SQL. Based on theories, all seem to be meeting the requirements. Could you please shed some light on the differences and what should be considered for the above mentioned use case. Tableau will be used for visualization on top of this.

Data processing - BigQuery vs Data Proc+BigQuery

We have large volumes (10 to 400 billion) of raw data in BigQuery tables. We have a requirement to process this data to convert and create the data in the form of star schema tables (probably a different dataset in bigquery) which can then be accessed by atscale.
Need pros and cons between two options below:
1. Write complex SQL within BigQuery which reads data form source dataset and then loads to target dataset (used by Atscale).
2. Use PySpark or MapReduce with BigQuery connectors from Dataproc and then load the data to BigQuery target dataset.
The complexity of our transformations involve joining multiple tables at different granularity, using analytics functions to get the required information, etc.
Presently this logic is implemented in vertica using multiple temp tables for faster processing and we want to re-write this processing logic in GCP (Big Query or Data Proc)
I went successfully with option 1: Big Query is very capable to run the very complex transformation with SQL, on top of that you can also run them incrementally with time range decorators. Note that it takes a lot of time and resources to take data back and forth to BigQuery. When running BigQuery SQL data never leaves BigQuery in the first place and you already have all raw logs there. So as long your problem can be solved by a series of SQL I believe this is the best way to go.
We moved out Vertica reporting cluster, rewriting successfully ETL last year, with option 1.
Around a year ago, I've written POC comparing DataFlow and series of BigQuery SQL jobs orchestrated by potens.io workflow allowing SQL parallelization at scale.
I took a good month to write DataFlow in Java with 200+ data points and complex transformation with terrible debugging capability at a time.
And a week to do the same using a series of SQL with potens.io utilizing
Cloud Function for Windowed Tables and parallelization with clustering transient tables.
I know there's been bunch improvement in CloudDataFlow since then, but at a time
the DataFlow did fine only at a million scale and never-completed at billions record input (main reason shuffle cardinality went little under billions of records, with each records having 200+ columns). And the SQL approach produced all required aggregation under 2 hours for a dozen billion. Debugging and easiest of troubleshooting with potens.io helped a lot too.
Both BigQuery and DataProc can handle huge amounts of complex data.
I think that you should consider two points:
Which transformation would you like to do in your data?
Both tools can make complex transformations but you have to consider that PySpark will provide you a full programming language processing capability while BigQuery will provide you SQL transformations and some scripting structures. If only SQL and simple scripting structures can handle your problem, BigQuery is an option. If you need some complex scripts to transform your data or if you think you'll need to build some extra features involving transformations in the future, PySpark may be a better option. You can find the BigQuery scripting reference here
Pricing
BigQuery and DataProc have different pricing systems. While in BigQuery you'd need to concern about how much data you would process in your queries, in DataProc you have to concern about your cluster's size and VM's configuration, how much time your cluster would be running and some other configurations. You can find the pricing reference for BigQuery here and for DataProc here. Also, you can simulate the pricing in the Google Cloud Platform Pricing Calculator
I suggest that you create a simple POC for your project in both tools to see which one has the best cost benefit for you.
I hope these information help you.

Apache Cassandra and Spark

I am an experienced RDBMD's developer and admin. But I am new to Apache Cassandra and Spark. I learned Cassandra's CQL, and the documentation says that CQL does not support joins and sub-queries because it would be too inefficient in Cassandra because of its distributed data nature.
So, I concluded that in distributed data env., joins and sub-queries are not supported because they will affect performance badly.
But then I learned Spark, which also works with distributed data, but Spark supports all SQL features including joins and sub-queries. Even though Spark is not database system and thus does not even have indexes... So, my question is how Spark does support joins and sub-queries on distributed data?, and does it do it efficiently?.
Thanks in advance.
Spark does the "hard work" required to do a join on distributed data. It performs large shuffles to align data on keys before actually performing joins. This basically means that any join requires a very large amount of data movement unless the original data sources are partitioned based on the keys used for joining.
C* does not allow for generic joins like this because of the cost involved, it is geared towards OLTP workloads and requiring a full data shuffle is inherently OLAP.
Apache spark has a concept of RDD(Resilient Distributed DataSet)which gets created in memory.
Its basically a fundamental data structure in spark.
Joins, queries are performed on this RDDs and as it operates in memory ,that`s the reason it is very efficient.
Please go through the docs below for getting some idea on Resilient Dataset
http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

Join Considerations in Azure SQL Data Warehouse

How should you design your Fact and Dimension tables too speed up joins on the new Azure SQL Data Warehouse?
Would distributing the large fact tables by hash and replicating the smaller dimension tables help speed up the join or should indexing be the main consideration?
Azure SQL Data Warehouse initially offers two table types - Round Robin and Hash Distributed (see the SQL DW Table docs at https://azure.microsoft.com/documentation/articles/sql-data-warehouse-develop-table-design/).
Generally for dimension tables, you'll choose round robin distribution. For fact tables you'll want to choose HASH based distributed table designs.
**Edit: Replicated is now supported too, which could be a useful option for some dimension tables.
Your basic premise of distributing large fact tables by hash and replicating the smaller dimension tables works great in MPP environments like PDW, but as SQL DW doesn't suppose replicated data (yet - hopefully one day), you'll need to use the Round Robin distribution for that.
If you can minimise data movement, then you make some good steps towards improvement the performance of joins. However, having the data on the right server is only half the battle, and you should consider the indexing strategies as well, just as you would in a regular (SMP) SQL Server environment.
Please note that ADW REPLICATE is in public preview but I think it is still buggy. I have several small tables that I have changed to REPLICATE but when I Join to these replicated tables and look at the explain xml plan, I still see data movement steps which should not be in the data is REPLICATED on all nodes. So to investigate why I did a DBCC PDW_SHOWSPACEUSED on several of the replicated tables and instead of seeing the row count being identical across all nodes they differ with some node having a zero row count. I am no expert by any means but I believe their is work to be done, but I cannot find any forums, discussions or feedback pages to report these issues to.

Pros & cons of BigQuery vs. Amazon Redshift [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Comparing Google BigQuery vs. Amazon Redshift shows that both can answer same set of requirements, differ mostly by cost plans. It seems that Redshift is more complex to configure (defining keys and optimization work) vs. Google BigQuery that perhaps has an issue with joining tables.
Is there a pros & cons list of Google BigQuery vs. Amazon Redshift?
I posted this comparison on reddit. Quickly enough a long term RedShift practitioner came to comment on my statements. Please see https://www.reddit.com/r/bigdata/comments/3jnam1/whats_your_preference_for_running_jobs_in_the_aws/cur518e for the full conversation.
Sizing your cluster:
Redshift will ask you to choose a number of CPUs, RAM, HD, etc. and to turn them on.
BigQuery doesn't care. Use it whenever you want, no provisioning needed.
Hourly costs when doing nothing:
Redshift will ask you to pay per hour of each of these servers running, even when you are doing nothing.
When idle BigQuery only charges you $0.02 per month per GB stored. 2 cents per month per GB, that's it.
Speed of queries:
Redshift performance is limited by the amount of CPUs you are paying for
BigQuery transparently brings in as many resources as needed to run your query in seconds.
Indexing:
Redshift will ask you to index (correction: distribute) your data under certain criteria, and you'll only be able to run fast queries based on this index.
BigQuery has no indexes. Every operation is fast.
Vacuuming:
Redshift requires periodic maintenance and 'vacuum' operations that last hours. You are paying for each of these server hours.
BigQuery does not. Forget about 'vacuuming'.
Data partitioning and distributing:
Redshift requires you to think about how to distribute data within your servers to keep performance up - optimization that works only for certain queries.
BigQuery does not. Just run whatever query you want.
Streaming live data:
Impossible(?) with Redshift.
BigQuery easily handles ingesting up to 100,000 rows per second per table.
Growing your cluster:
If you have more data, or more concurrent users scaling up will be painful with Redshift.
BigQuery will just work.
Multi zone:
You want a multi-zone Redshift for availability and data integrity? Painful.
BigQuery is multi-zoned by default.
To try BigQuery you don't need a credit card or any setup time. Just try it (quick instructions to try BigQuery).
When you are ready to put your own data into BigQuery, just copy your JSON new-line separated logs from to Google Cloud Storage and import them.
See this in depth guide to data warehouse pricing on the cloud:
Understanding Cloud Pricing Part 3.2 - More Data Warehouses
Amazon Redshift is a standard SQL database (based on Postgres) with MPP features that allow it to scale. These features also require you to conform your data model somewhat to get the best performance. It supports a large amount of the SQL standard and most tools that can speak to Postgres can use it unchanged.
BigQuery is not a database, in the sense that there it doesn't use standard SQL and doesn't provide JDBC/ODBC connectivity. It's a unique service with it's own API and interfaces. It provides limited support for SQL queries but most users interact with via custom code (Java, Python, etc.). Some 3rd party tools have added support for BigQuery but existing tools will not work without modification.
tl;dr - Redshift is better for interacting with existing tools and using complex SQL. BigQuery is better for custom coded interactions and teams who dislike SQL.
UPDATE 2017-04-17 - Here's a much more up to date summary of the cost and speed differences (wrapped in a sales pitch so YMMV). TL;DR - Redshift is usually faster and will be cheaper if you query the data somewhat regularly. http://blog.panoply.io/a-full-comparison-of-redshift-and-bigquery
UPDATE - Since I keep getting down votes on this (🤷‍♂️) here's an up-to-date response to the items in the other answer:
Sizing your cluster:
Redshift allows you to tailor your costs to your usage. If you want the fastest possible queries choose SSD nodes and if you want the lowest possible cost per GB choose HDD nodes. Start small and add nodes whenever you want.
Hourly costs when doing nothing:
Redshift keeps your cluster ready for queries, can respond in milliseconds (result cache) and it provides a simple, predictable monthly bill.
For example, even if some script accidentally runs 10,000 giant queries over the weekend your Redshift bill will not increase at all.
Speed of queries:
Redshift performance is absolutely best in class and gets faster all the time. 3-5x faster in the last 6 months.
Indexing:
Redshift has no indexes. It allows you to define sort keys to optimize performance from fast to insanely fast.
Vacuuming:
Redshift now automatically runs routine maintenance such as ANALYZE and VACUUM DELETE when your cluster has free resource.
Data partitioning and distributing:
Redshift never requires distribution. It allows you to define distribution keys which can make even huge joins very fast.
{Ask competitors about join performance…}
Streaming live data:
Redshift has 2 choices
Stream real time data into Redshift using Amazon Kinesis Firehose.
Skip ingestion altogether by querying your real time instantly on S3 as soon as it land (and at high speeds) using Redshift Spectrum external tables.
Growing your cluster:
Redshift can elastically resize most clusters in a few minutes.
Multi zone:
Redshift seamlessly replaces any failed hardware and continuously backs up your data, including across regions if desired.