Spark sql queries vs dataframe functions - sql

To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select().
Any idea? :)

There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.
Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.
Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).

Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style.
In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.

By using DataFrame, one can break the SQL into multiple statements/queries, which helps in debugging, easy enhancements and code maintenance.
Breaking complex SQL queries into simpler queries and assigning the result to a DF brings better understanding.
By splitting query into multiple DFs, developer gain the advantage of using cache, reparation (to distribute data evenly across the partitions using unique/close-to-unique key).

The only thing that matters is what kind of underlying algorithm is used for grouping.
HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n)
HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map.
Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)

Related

Data profiling of columns for big table (SQL Server)

I have table with over 40 million records. I need to make data profiling, including Nulls count, Distinct Values, Zeros and Blancs, %Numeric, %Date, Needs to be Trimmed, etc.
The examples that I was able to find are always including implementation of the task using cursors. For big table such solution is performance killer.
I would be happy if I receive suggestions and examples which give better performance alternatives. Is it possible to create multiple stored procedures and combine the results in a table? I have not used stored procedures so far, so I base my question only on understanding that I got from documentation.
As Gordon mentioned, you should include your table's schema and some sample data to get the best answers, but a couple things you can look into are as follows:
Columnstore Indexes - These can be helpful for analytical querying against a table, e.g. SUM(), COUNT(), COUNT(DISTINCT) etc. This is because of the efficiencies in compression that can be achieved up and down the column for analytics. This is useful if you need a "real time" based on answer every time you query against the data.
You can periodically stage and update the results in a data warehouse type table. You basically can store the results to those aggregations in it's own table and periodically update it with either a SQL Agent Job (this isn't necessarily a real time solution) or use triggers to automatically update your data warehouse table (which will be closer to a real time solution but can be performance heavy if not implemented in a lean manner).
OLAP Cubes - This is more of an automated way to the above solution and has better maintainability but is also more advanced of a solution. This is a methodology for building out an actual OLAP based data warehouse.
In terms of difficulty of implementation and based on the size of your data (which isn't anything too huge) my recommendation would be to start with columnstore indexes and see how that helps your queries. I've had much success using them for analytical querying. Otherwise my remaining recommendations are in order of difficulty as well.

sort_values vs order by and when and why should I use which

This is mainly in the scope of jupyter notes and queries in pandas(i'm very new to both) . I noticed that when I write a query where I need the dataframe in a certain order I do:
df = pd.read_sql("select date, count(*) as count from "+tableName+" group by date" ,conn").sort_values(['date'].ascending=False)
My friends who are much more experienced than me do :
df = pd.read_sql("select date, count(*) as count from "+tableName+" group by date order by date",conn")
The results are the same but I couldnt get an answer about why/when I would use order by over sort_values
Like you said, both achieve the same output. The difference in is where the sorting operation takes place. In the first case, sort_values() is a pandas function which you've chained onto the first read_sql() function. This means that your Python engine is executing the sort after it retrieves the data from the database connection. This is equivalent to doing something like:
df = pd.read_sql("select date, count(*) as count from "+tableName+" group by date" ,conn)
df = df.sort_values(by='date', ascending=False) #sorting done in python environment, not by the database
The second method performs the sorting in the database, so the python environment doesn't sort anything. The key here is to remember that you're basically composing a SQL statement and running it using Python pandas.
Whether or not you should put the burden of sorting on the database or on your machine that's running the python environment depends. If this is a very busy production database you might not want to run expensive sorting operations but simply retrieve the data and perform all operations locally using pandas. Alternatively, if the database is for casual use or non-critical, then in this case it makes sense to just sort the results and before loading the data into pandas.
Update:
To reinforce the notion that SQL engine driven (server side, or db driven) sorting isn't necessarily always the most optimal thing to do, please read this article which has some interesting profiling stats and common scenarios of when to load the db with data manipulation operations vs. when to do it "locally".
I can think of a few reasons here:
Performance
Many, many hours of effort have gone into tuning the code that runs SQL commands. SQL is fast, and I'm willing to bet it is going to be faster to sort with the SQL engine than with pandas.
Maintainability
If, for example, you determine that you do not need the result sorted tomorrow, then you can simply change the query string without having to change your code. This is particularly useful if you are passing the query to some function which runs it for you.
Aesthetics
As a programmer with good design sense, surely the second method should appeal to you. Segmenting logic into separate pieces is definitely a recipe for bad design.

Are there any specialized databases for aggregate queries?

Are there any specialized databases - rdbms, nosql, key-value, or anything else - that are optimised for running fast aggregate queries or map-reduces like this over very large data sets:
select date, count(*)
from Sales
where [various combinations of filters]
group by date
So far I've run benchmarks on MongoDB and SQL Server, but I'm wondering if there's a more specialized solution, preferably one that can scale data horizontally.
In my experience, the real issue has less to do with aggregate query performance, which I find good in all major databases I've tried, than it has to do with the way queries are written.
I've lost count of the number of times I've seen enormous report queries with huge amounts of joins and inline subquery aggregates all over the place.
Off the top of my head, the typical steps to make these things faster are:
Use window functions where available and applicable (i.e. the over () operator). There's absolutely no point in refetching data multiple times.
Use common table expressions (with queries) where available and applicable (i.e. sets that you know will be reasonably small).
Use temporary tables for large intermediary results, and create indexes on them (and analyze them) before using them.
Work on small result sets by filtering rows earlier when possible: select id, aggregate from (aggregate on id) where id in (?) group by id can made much faster by rewriting it as select id, aggregate from (aggregate on id where id in (?)) group by id.
Use union/except/intersect all rather than union/except/intersect where applicable. This removes pointless sorting of result sets.
As a bonus the first three steps all tend to make the report queries more readable and thus more maintainable.
Oracle, DB2 Warehouse edition, and to a lesser degree SQLServer enterprise are all very good on these aggregate queries -- of course these are expensive solutions and it depends very much on your budget and business case whether its worth it.
Pretty much any OLAP database, this is exactly the type of thing they're designed for.
OLAP data cubes are designed for this. You denormalize data into forms that they can compute on quickly. The denormalization and pre computation steps can take time, so these databases are typically built only for reporting and are separate from the real time transactional data.
For certain kinds of data (large volumes, time series) kx.com provides probably the best solution: kdb+. If it looks like your kind of data, give it a try. Note: they don't use SQL, but rather a more general, more powerful, and more crazy set-theoretical language.

Optimisation techniques for sqlite database

I'm creating Android application that contains a large amount of data. It takes lot of time to access the data. What are optimization technique that I can use ? Are there any optimized queries?
This page will give you a lot of good tips on how to optimize SQLite things:
http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html
Keep in mind that some stuff SQLite will optimize for you as well when it runs a query:
http://www.sqlite.org/optoverview.html
Use parameterized queries to reduce the amount of queries that needs to be parsed, see: How do I get around the "'" problem in sqlite and c#?
You can use parametrized statements in Android:
http://developer.android.com/reference/android/database/sqlite/SQLiteDatabase.html#compileStatement%28java.lang.String%29
SQLite does not support dates natively (SQLite uses strings to store dates). If you are using an index to access dates - you'll get slow query times (not to mention inaccurate or wrong results..)
If you really need to sort through dates then I'd suggest you create separate columns for the various date elements that you want to index (like years, months, and days). Define these columns as integers and add INDEX statements to index their contents.

To aggregate or not to aggregate, that is the database schema design question

If you're doing min/max/avg queries, do you prefer to use aggregation tables or simply query across a range of rows in the raw table?
This is obviously a very open-ended question and there's no one right answer, so I'm just looking for people's general suggestions. Assume that the raw data table consists of a timestamp, a numeric foreign key (say a user id), and a decimal value (say a purchase amount). Furthermore, assume that there are millions of rows in the table.
I have done both and am torn. On one hand aggregation tables have given me significantly faster queries but at the cost of a proliferation of additional tables. Displaying the current values for an aggregated range either requires dropping entirely back to the raw data table or combining more fine grained aggregations. I have found that keeping track in the application code of which aggregation table to query when is more work that you'd think and that schema changes will be required, as the original aggregation ranges will invariably not be enough ("But I wanted to see our sales over the last 3 pay periods!").
On the other hand, querying from the raw data can be punishingly slow but lets me be very flexible about the data ranges. When the range bounds change, I simply change a query rather than having to rebuild aggregation tables. Likewise the application code requires fewer updates. I suspect that if I was smarter about my indexing (i.e. always having good covering indexes), I would be able to reduce the penalty of selecting from the raw data but that's by no means a panacea.
Is there anyway I can have the best of both worlds?
We had that same problem and ran into the same issues you ran into. We ended up switching our reporting to Analysis Services. There is a learning curve with MDX and Analysis services itself, but it's been great. Some of the benefits we have found are:
You have a lot of flexibility for
querying any way you want. Before we
had to build specific aggregates,
but now one cube answers all our
questions.
Storage in a cube is far smaller
than the detailed data.
Building and processing the cubes
takes less time and produces less
load on the database servers than
the aggregates did.
Some CONS:
There is a learning curve around
building cubes and learning MDX.
We had to create some tools to
automate working with the cubes.
UPDATE:
Since you're using MySql, you could take a look at Pentaho Mondrian, which is an open source OLAP solution that supports MySql. I've never used it though, so I don't know if it will work for you or not. Would be interested in knowing if it works for you though.
It helps to pick a good primary key (ie [user_id, used_date, used_time]). For a constant user_id it's then very fast to do a range-condition on used_date.
But as the table grows, you can reduce your table-size by aggregating to a table like [user_id, used_date]. For every range where the time-of-day doesn't matter you can then use that table. An other way to reduce the table-size is archiving old data that you don't (allow) querying anymore.
I always lean towards raw data. Once aggregated, you can't go back.
Nothing to do with deletion - unless there's the simplest of aggregated data sets, you can't accurately revert/transpose the data back to raw.
Ideally, I'd use a materialized view (assuming that the data can fit within the constraints) because it is effectively a table. But MySQL doesn't support them, so the next consideration would be a view with the computed columns, or a trigger to update an actual table.
Long history question, for currently, I found this useful, answered by microstrategy engineer
BTW, another already have solutions like (cube.dev/dremio) you don't have to do by yourself.