DISTRIBUTE BY & CLUSTER BY in Spark-SQL

DISTRIBUTE BY & CLUSTER BY in Spark-SQL - sql

I just got introduced to Spark-SQL. Although I have previous experience in RDBMS sql (Oracle, Teradata, Sql Server etc), I am looking for expanding my knowledge in Spark-SQL why learning advanced functions/concepts in Spark-SQL.
So I came across the clauses DISTRIBUTE BY & CLUSTER BY during the process. However, I haven't been able to figure out if these clauses work in Spark SQL , and if they do, how they work.
Hence, can anyone point me in the right direction ? It will be great if someone explained these 2 clauses with some examples (provided they could be used in Spark-Sql) and also point me to resources for learning advanced functions of Spark-Sql.
Thanks.

Related

Tableau take forever to use a PostgreSQL view

I am trying to connect Tableau to a SQL view I made in PostgreSQL.
This view returns ~80k rows with 12 fields. On my local PostgreSQL database, it take 7 seconds to execute. But when I try to create a chart in a worksheet using this view, it take forever to display something (more than 2 minutes to add just a field).
This views in complex and involve many join, coalesce and case due to business specifities.
Do you guys have an idea to improve?
Thank you very much for your help ! :-)
Best,
Max

Tableau documentation has helpful info for performance optimization
https://help.tableau.com/current/pro/desktop/en-us/performance_tips.htm
I highly recommend the whitepaper on designing efficient dashboards mentioned on that site - a bit dated, but timeless advice
For starters, learn to use the Performance Recorder in Tableau to find out what tasks are causing delays, and if they involve queries, to capture the SQL that Tableau emits.
With Tableau, and many other client tools, the standard first approach is to see what SQL the client tool generates, then execute that SQL without using the client tool, say just in psql in your case. If you can reproduce the slow query just in SQL, then you are better positioned to either
Optimize your database, say either with indices, or restructuring your schema OR
Understand why your client tool, Tableau in this case, generated that inefficient query and reason about what you could differently in Tableau that would cause it to generate different SQL
The whitepaper I mentioned should be helpful

Custom SQL Query in Tableau

Does using 'custom SQL' instead of joins in Tableau increase the performance of extract refresh on the server? Can someone explain it briefly?

The answer to almost every performance question is first, "it depends" and second, test and understand the measurement results. Real results carry more weight than advice from anyone on the Internet (from me or anyone else)
Still, custom SQL is usually not helpful for increasing performance in Tableau, and often hurts. It is usually much better to define your relationships in Tableau and let Tableau then generate optimized SQL for each view -- just as you let a compiler generate optimized machine code.
When you use custom SQL, you prevent Tableau from optimizing the SQL it generates. It has to run the SQL you provide in a subquery.
The best use case for custom SQL in Tableau is to access database specific features, or possibly windowing queries. Most other SQL functionality is available by using the corresponding Tableau features.
If you do have a complex slow custom SQL query that you must use, it is usually a good idea to make an extract so you only pay the performance cost during extract refresh.
So in your case, I'd focus effort on streamlining or eliminating the custom SQL, monitoring the query plan for the generated SQL, and indexing your database to best support that query.

Spark SQL query optimization techniques

I have recently been introduced to Spark-SQL and trying to wrap my head around it. I am looking to learn best practices, tips and tricks used for optimization of Spark-SQL queries.
Most importantly, I wish to learn about interpreting Spark SQL EXPLAIN plans. I have searched online searching for books/articles on Spark SQL Explain but ended up with almost nothing.
Can anyone please help me and orient me in the right direction.
Due to Spark's architectural difference to traditional RDBMS, there are many relational optimization options that doesn't apply to Spark (e.g. leveraging indexes etc).
I could not find many resources related exclusively to Spark-SQL. I wish to learn about the best tips/techniques (eg, usage of hints, order of tables in join clauses i.e. keeping the largest table at the end of the joining conditions etc) to write efficient queries for Spark-SQL.
Most importantly, any resources on understanding and leveraging Spark-SQL Explain Plans will be great.
However, please note that I have access only to Spark-SQL but Not PySpark SQL.
Any help is appreciated.
Thanks

Oracle vs SQL Server query optimization main difference

I am doing a paper about query optimization in different DBMS, and I am trying to find core differences in those.
Both use CBO, cost based optimization in the same way, parse the query -> generate plans -> pick best one given statistics about the database.
I'm still researching information on those two engines, but if someone knows how they differ (or not) will be appreciated.

Not a comprehensive answer at all, but wanted to give you my insight. In short, Oracle has a much more developed SQL optimizer.
For starters, Oracle has much more algorithms to choose from. This means, sometimes Oracle distinguish between subtle differences and offer, let's say, three algorithms; MySQL (under the same circumstances) only has one to choose from. Therefore, Oracle has better options for particular cases.
Another difference is that MySQL's execution plans are not very readable. I'm not saying they are bad internally, just that the explain extended doesn't tell you many specifics. Oracle makes a very clear difference between access and filter predicates, while in MySQL you don't really know what's going on.
Oracle has many algorithms suitable for parallel processing in multi servers, while MySQL is limited to multiple thread in the same machine. This can make a difference for highly parallelisable queries than benefit for multi-servers.
Oracle still has a RBO (Rule-Based Optimizer) than can be useful on some occasions. MySQL doesn't. Anyway, Oracle recommends not to use it, but it's still there if you need it.
Oracle offers a myriad of "hints" to the optimizer in the form of comments (/* ... */ as far as I remember?) where you can tweak the execution plan to suit your needs. MySQL has fewer "clauses" for this.

Can you recommend a good source for Teradata Best Practices?

Looks like my data warehouse project is moving to Teradata next year (from SQL Server 2005).
I'm looking for resources about best practices on Teradata - from limitations of its SQL dialect to idioms and conventions for getting queries to perform well - particularly if they highlight things which are significantly different from SQL Server 2005. Specifically tips similar to those found in The Art of SQL (which is more Oracle-focused).
My business processes are currently in T-SQL stored procedures and rely fairly heavily on SQL Server 2005 features like PIVOT, UNPIVOT, and Common Table Expressions to produce about 27m rows of output a month from a 4TB data warehouse.

One place to start is here: http://www.teradataforum.com/
This might be a little late, but there are a few things which I can warn you about Teradata which I have learned.
Use the most recent version as often as possible.
For V12 the optimizer was re-written and the database performs much better now.
Try to realize that SQL Server and Teradata are very different beasts, most of the concepts will not transition well.
Do not underestimate the importance of a primary index.
The locks that teradata uses are very primitive when compared to other databases.
Do NOT use TERA mode. You do not have any code which is legacy, ANSI mode is far superior and is widely encouraged.
Join indexes are very helpful tools, but they do not provide all the answers.
Parallelism, take the time to understand how FASTLOAD, MULTILOAD, and TPUMP works and find out how one can leverage it with their ETL strategy.
If you are attempting to run a query which needs to be performant, do not use any casts, the optimizer will not use statistics to generate the best execution plan.
Working with dates are going to be a pain, just a warning.
Teradata is very DDL oriented, try to understand all the syntax related when creating a table.
Compression is a wonderful tool, if you have any values which are repeated in a table, make use of it.
There are not many tools available with Teradata, be prepared to build a lot. The tools that exist are very expensive.
Unfortunately, I do not know much about SQL Server, so I cannot say what tools in SQL Server appear in Teradata.
Hope this helps

I would also look into the recently launched Teradata Developer Exchange as well as the TeradataForum and forums on Teradata's main website.

I don't know of any good references available online. Teradata has some design manuals that are available for download, but they're more instruction manuals and not "best practices" as such. check them out here: http://www.info.teradata.com/DataWarehouse/eTeradata-BrowseBy.cfm?page=Teradata%20Database
Alternatively, you need to find a friendly Teradata expert to bounce ideas off. Try Teradata themselves, or find a local consultant with Teradata experience.
Best Practices on Teradata isn't a topic that gets lots of discussions and most of the best tricks tend to be proprietary knowledge of the person/people who discovered them.
Sorry,
David Stewardson
Satyam Computer Services

Top of the list on a Google search for "Teradata Best Practices" gave me TERADATA ADVISORY GROUP SETS BEST PRACTICES FOR BUSINESS OBJECTS AND TERADATA CUSTOMERS
EDIT: Seeing as that's just advertising, as you've pointed out, see how you go with these. Please bear in mind that I don't have a clue what Teradata is and can't see myself using it any time this side of the 22nd century AD.
Teradata Discussion Forums
Best Practices for Teradata Deployments
Best Study Guides For NCR Teradata Certifications
The middle one looks promising with it's nice long link tree at the top
Oracle® Business Intelligence Applications Installation and Configuration Guide > Preinstallation and Predeployment Considerations for Oracle BI Applications > Teradata-Specific Database Guidelines for Oracle Business Analytics Warehouse >
and the first link, to the forums, should put you in touch with the right people.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas