pyspark using sql queries and doing group by optimisation - sql

In spark one can write sql queries as well use spark api functions. ReduceByKey should always be used than groupbykey as it prevents more shuffling.
I would like to know, when you use sql queries by registering the dataframe how can we use reduceby ? In sql queries there is only group by no reduce by. Do internally it optimises to use reduceBykey than a group by ?

I got it. I actually did an explain to understand the physical plan and it first executes a function as partial_sum and then after that executes the function sum which implies that it has first performed a sum within executors and then shuffled across.

Related

SQL order of execution

I wonder how this query is executing successfully. As we know 'having' clause execute before the select one then here how alias name used in 'select' statement working in having condition and not giving any error.
As we know 'having' clause execute before the select one
This affirmation is wrong. The HAVING clause is used to apply filters in aggregation functions (such as SUM, AVG, COUNT, MIN and MAX). Since they need to be calculated BEFORE applying any filter, in fact, the SELECT statement is done when the HAVING clause start to be processed.
Even if the previous paragraph was not true, it is important to consider that SQL statements are interpreted as a whole before any processing. Due to this, it doesn't really matter the order of the instructions: the interpreter can link all references so they make sense in runtime.
So it would be perfectly feasible to put the HAVING clause before the SELECT or in any part of the instruction, because this is just a syntax decision. Currently, HAVING clause is after GROUP BY clause because someone decided that this syntax makes more sense in SQL.
Finally, you should consider that allowing you to reference something by an alias is much more a language feature than a rational on how the instruction is processed.
the order of exution is
Getting Data (From, Join)
Row Filter (Where)
Grouping (Group by)
Group Filter (Having)
Return Expressions (Select)
Order & Paging (Order by & Limit / Offset)
I still don't get, why you are asking about, syntactially your seelect qiery is correct, but if it the correct result we can not know
Spark SQL engine is obviously different than the normal SQL engine because it is a distributed SQL engine. The normal SQL order of execution does not applied here because when you execute a query via Spark SQL, the engine converts it into optimized DAG before it is distributed across your worker nodes. The worker nodes then do map, shuffle, and reduce tasks before the result is aggregated and returned to the driver node. Read more about Spark DAG here.
Therefore, there are more than just one selecting, filtering, aggregation happening before it returns any result. You can see it yourself by clicking on Spark job view on the Databricks query result panel and then select Associated SQL Query.
So, when it comes to Spark SQL, I recommend we refer to Spark document which clearly indicates that Having clause can refer to aggregation function by its alias.

sort_values vs order by and when and why should I use which

This is mainly in the scope of jupyter notes and queries in pandas(i'm very new to both) . I noticed that when I write a query where I need the dataframe in a certain order I do:
df = pd.read_sql("select date, count(*) as count from "+tableName+" group by date" ,conn").sort_values(['date'].ascending=False)
My friends who are much more experienced than me do :
df = pd.read_sql("select date, count(*) as count from "+tableName+" group by date order by date",conn")
The results are the same but I couldnt get an answer about why/when I would use order by over sort_values
Like you said, both achieve the same output. The difference in is where the sorting operation takes place. In the first case, sort_values() is a pandas function which you've chained onto the first read_sql() function. This means that your Python engine is executing the sort after it retrieves the data from the database connection. This is equivalent to doing something like:
df = pd.read_sql("select date, count(*) as count from "+tableName+" group by date" ,conn)
df = df.sort_values(by='date', ascending=False) #sorting done in python environment, not by the database
The second method performs the sorting in the database, so the python environment doesn't sort anything. The key here is to remember that you're basically composing a SQL statement and running it using Python pandas.
Whether or not you should put the burden of sorting on the database or on your machine that's running the python environment depends. If this is a very busy production database you might not want to run expensive sorting operations but simply retrieve the data and perform all operations locally using pandas. Alternatively, if the database is for casual use or non-critical, then in this case it makes sense to just sort the results and before loading the data into pandas.
Update:
To reinforce the notion that SQL engine driven (server side, or db driven) sorting isn't necessarily always the most optimal thing to do, please read this article which has some interesting profiling stats and common scenarios of when to load the db with data manipulation operations vs. when to do it "locally".
I can think of a few reasons here:
Performance
Many, many hours of effort have gone into tuning the code that runs SQL commands. SQL is fast, and I'm willing to bet it is going to be faster to sort with the SQL engine than with pandas.
Maintainability
If, for example, you determine that you do not need the result sorted tomorrow, then you can simply change the query string without having to change your code. This is particularly useful if you are passing the query to some function which runs it for you.
Aesthetics
As a programmer with good design sense, surely the second method should appeal to you. Segmenting logic into separate pieces is definitely a recipe for bad design.

Does the number of columns used for a CTE affects the performance of the query?

Using more columns within a CTE query affects the performance? I am currently trying to execute a query with the WITH sentence, and it seams that if I use more colum,s, it takes more time to load the data. Am I correct?
The number of columns defined in a CTE should have no effect on the actual performance of the query (it might affect the compile-time, which is generally miniscule).
Why? Because SQL Server "embeds" the code for the CTE in the query itself and then optimizes all the code together. Unused columns should be eliminated.
This might be an over generalization. There might be some cases where SQL Server doesn't eliminate the work for columns -- such as extra aggregation functions in an aggregation query or certain subqueries. But, in general, what is important is how the CTE is used, not how many columns are defined in it.
You can think of CTE as a View but it doesnt materialize to Disk.So A view expands it definition at run time ,same goes for CTE.

Is there a way to query a postgresql database using pandas syntax?

Is there some sort of adaptor that allows querying a postgresql database like it was a pandas dataframe?
Update (16th March 2016)
It is possible, but you would have to have a compiler, which evaluates your query and transforms it into SQL clauses.
The fact that SQL is a higher level language and that DBMS interpret SQL clauses with regard to not only the query, but also the data and its distribution, makes this really hard do to performantly.
Wes McKinney is trying to do this with Ibis project and has a nice writeup about some of the challenges.
Previous post
Unfortunately that's not possible, because SQL is higher level language than Python.
With pandas you specify what and how you want to do something, whereas with SQL you only specify what you want. The SQL server is then free to decide how to serve your query. When you add an index to a table, the SQL server can then use that index to serve your query faster without you rewriting your query.
If you instructed your database how you want it to execute your query, then you would also need to rewrite your SQL statements if you wanted them to use an index.
That being said, I commonly use the pattern in neurite's answer for analysis, using SQL to perform initial aggregation (and reduce size of data) and then perform other operations in pandas.
Not sure if this is exactly what you want but you can load postgres tables into pandas and manipulate them from there.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html
http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html
Shamelessly stolen from the pages referenced above:
import pandas
from sqlalchemy import create_engine
engine = create_engine(
'postgresql+pg8000://scott:tiger#localhost/test',
isolation_level='READ UNCOMMITTED'
)
df = pandas.read_sql('SELECT * FROM <TABLE>;' con=engine)

Spark sql queries vs dataframe functions

To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select().
Any idea? :)
There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.
Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.
Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).
Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style.
In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.
By using DataFrame, one can break the SQL into multiple statements/queries, which helps in debugging, easy enhancements and code maintenance.
Breaking complex SQL queries into simpler queries and assigning the result to a DF brings better understanding.
By splitting query into multiple DFs, developer gain the advantage of using cache, reparation (to distribute data evenly across the partitions using unique/close-to-unique key).
The only thing that matters is what kind of underlying algorithm is used for grouping.
HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n)
HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map.
Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)