Are there any specialized databases for aggregate queries? - sql

Are there any specialized databases - rdbms, nosql, key-value, or anything else - that are optimised for running fast aggregate queries or map-reduces like this over very large data sets:
select date, count(*)
from Sales
where [various combinations of filters]
group by date
So far I've run benchmarks on MongoDB and SQL Server, but I'm wondering if there's a more specialized solution, preferably one that can scale data horizontally.

In my experience, the real issue has less to do with aggregate query performance, which I find good in all major databases I've tried, than it has to do with the way queries are written.
I've lost count of the number of times I've seen enormous report queries with huge amounts of joins and inline subquery aggregates all over the place.
Off the top of my head, the typical steps to make these things faster are:
Use window functions where available and applicable (i.e. the over () operator). There's absolutely no point in refetching data multiple times.
Use common table expressions (with queries) where available and applicable (i.e. sets that you know will be reasonably small).
Use temporary tables for large intermediary results, and create indexes on them (and analyze them) before using them.
Work on small result sets by filtering rows earlier when possible: select id, aggregate from (aggregate on id) where id in (?) group by id can made much faster by rewriting it as select id, aggregate from (aggregate on id where id in (?)) group by id.
Use union/except/intersect all rather than union/except/intersect where applicable. This removes pointless sorting of result sets.
As a bonus the first three steps all tend to make the report queries more readable and thus more maintainable.

Oracle, DB2 Warehouse edition, and to a lesser degree SQLServer enterprise are all very good on these aggregate queries -- of course these are expensive solutions and it depends very much on your budget and business case whether its worth it.

Pretty much any OLAP database, this is exactly the type of thing they're designed for.

OLAP data cubes are designed for this. You denormalize data into forms that they can compute on quickly. The denormalization and pre computation steps can take time, so these databases are typically built only for reporting and are separate from the real time transactional data.

For certain kinds of data (large volumes, time series) kx.com provides probably the best solution: kdb+. If it looks like your kind of data, give it a try. Note: they don't use SQL, but rather a more general, more powerful, and more crazy set-theoretical language.

Related

Data profiling of columns for big table (SQL Server)

I have table with over 40 million records. I need to make data profiling, including Nulls count, Distinct Values, Zeros and Blancs, %Numeric, %Date, Needs to be Trimmed, etc.
The examples that I was able to find are always including implementation of the task using cursors. For big table such solution is performance killer.
I would be happy if I receive suggestions and examples which give better performance alternatives. Is it possible to create multiple stored procedures and combine the results in a table? I have not used stored procedures so far, so I base my question only on understanding that I got from documentation.
As Gordon mentioned, you should include your table's schema and some sample data to get the best answers, but a couple things you can look into are as follows:
Columnstore Indexes - These can be helpful for analytical querying against a table, e.g. SUM(), COUNT(), COUNT(DISTINCT) etc. This is because of the efficiencies in compression that can be achieved up and down the column for analytics. This is useful if you need a "real time" based on answer every time you query against the data.
You can periodically stage and update the results in a data warehouse type table. You basically can store the results to those aggregations in it's own table and periodically update it with either a SQL Agent Job (this isn't necessarily a real time solution) or use triggers to automatically update your data warehouse table (which will be closer to a real time solution but can be performance heavy if not implemented in a lean manner).
OLAP Cubes - This is more of an automated way to the above solution and has better maintainability but is also more advanced of a solution. This is a methodology for building out an actual OLAP based data warehouse.
In terms of difficulty of implementation and based on the size of your data (which isn't anything too huge) my recommendation would be to start with columnstore indexes and see how that helps your queries. I've had much success using them for analytical querying. Otherwise my remaining recommendations are in order of difficulty as well.

Slow Queries in SQL

I'm a database noobie when it comes to even moderately large data sets. I have a SQL database (multiple sql databases actually, a SQLite, Postgres, and MySQL database) all containing the same data, dumped from IMDB. I want to benchmark these different databases. The main table I want to query has about 15 million rows. I want a query that crosses two movies, right now my query looks like this
SELECT * from acted_in INNER JOIN actors
ON acted_in.idactors = actors.idactors WHERE
(acted_in.idmovies = %d OR acted_in.idmovies = %d)
the parameters are randomly generated ids. I want to test the relative speed of the databases by running this query multiple times for randomly generated movies and seeing the amount of time it takes on average. My question is, is there any better way to do the same query, I want to join who acted in what with their information from either of the two movies as this will be the core functionality for the project I am working on, right now the speed is abysmal currently the average speed for a single query is
sqlite: 7.160171360969543
postgres: 8.263306670188904
mysql: 13.27652293920517
This is average time per query (sample space of only 100 queries but it is significant enough for now). So can I do any better? The current run time is completely unacceptable for any practical use. I don't think the joining takes a lot of time, by removing it I get nearly the same results so I believe the lookup is what is taking a long time, as I don't gain a significant speed up when I don't join or look up using the OR conditional.
The thing you don't mention here is having any indexes in the databases. Generally, the way you speed up a query (except for terribly written ones, which this is not) is by adding indexes to the things which are used in join or where criteria. This would slow down updates since the indexes need to be updated any time the table is updated, but would speed up selections using those attributes quite substantially. You may wish to consider adding indexes to any attributes you use which are not already primary keys. Be sure to use the same index type in all databases to be fair.
First off, microbenchmarks on databases are pretty noninformative and it's not a good idea to base your decision on them. There are dozens of better criteria to select a db, like reliability, behavior under heavy loads, availability of certain features (eg an extensible language like the PostGIS extension for postgres, partitioning, ...), license (!!), and so on.
Second, if you want to tune your database, or database server, there's a number of things you have to consider. Some important ones:
db's like lots of memory and fast disks, so setup your server with ample quantities of both.
use the query analysis features offered by all major db's (eg the very visual explain feature in pgadmin for postgres) to analyze the behavior of queries that are important for your use case, and adapt the db based on what you learn from these analyses (eg extra or other indexes)
study to understand your db server well, these are pretty sophisticated programs with lots of settings that influence their behavior and performance
make sure you understand the workload your db is subjected to, eg by using a tool like pgfouine for postgres, others exist for other brands of databases.

Sql optimization Techniques

I want to know optimization techniques for databases that has nearly 80,000 records,
list of possibilities for optimizing
i am using for my mobile project in android platform
i use sqlite,i takes lot of time to retreive the data
Thanks
Well, with only 80,000 records and assuming your database is well designed and normalized, just adding indexes on the columns that you frequently use in your WHERE or ORDER BY clauses should be sufficient.
There are other more sophisticated techniques you can use (such as denormalizing certain tables, partitioning, etc.) but those normally only start to come into play when you have millions of records to deal with.
ETA:
I see you updated the question to mention that this is on a mobile platform - that could change things a bit.
Assuming you can't pare down the data set at all, one thing you might be able to do would be to try to partition the database a bit. The idea here is to take your one large table and split it into several smaller identical tables that each hold a subset of the data.
Which of those tables a given row would go into would depend on how you choose to partition it. For example, if you had a "customer_id" field that could range from 0 to 10,000 you might put customers 0 - 2500 in table1, 2,500 - 5,000 in table2, etc. splitting the one large table into 4 smaller ones. You would then have logic in your app that would figure out which table (or tables) to query to retrieve a given record.
You would want to partition your data in such a way that you generally only need to query one of the partitions at a time. Exactly how you would partition the data would depend on what fields you have and how you are using them, but the general idea is the same.
Create indexes
Delete indexes
Normalize
DeNormalize
80k rows isn't many rows these days. Clever index(es) with queries that utlise these indexes will serve you right.
Learn how to display query execution maps, then learn to understand what they mean, then optimize your indices, tables, queries accordingly.
Such a wide topic, which does depend on what you want to optimise for. But the basics:
indexes. A good indexing strategy is important, indexing the right columns that are frequently queried on/ordered by is important. However, the more indexes you add, the slower your INSERTs and UPDATEs will be so there is a trade-off.
maintenance. Keep indexes defragged and statistics up to date
optimised queries. Identify queries that are slow (using profiler/built-in information available from SQL 2005 onwards) and see if they could be written more efficiently (e.g. avoid CURSORs, used set-based operations where possible
parameterisation/SPs. Use parameterised SQL to query the db instead of adhoc SQL with hardcoded search values. This will allow better execution plan caching and reuse.
start with a normalised database schema, and then de-normalise if appropriate to improve performance
80,000 records is not much so I'll stop there (large dbs, with millions of data rows, I'd have suggested partitioning the data)
You really have to be more specific with respect to what you want to do. What is your mix of operations? What is your table structure? The generic advice is to use indices as appropriate but you aren't going to get much help with such a generic question.
Also, 80,000 records is nothing. It is a moderate-sized table and should not make any decent database break a sweat.
First of all, indexes are really a necessity if you want a well-performing database.
Besides that, though, the techniques depend on what you need to optimize for: Size, speed, memory, etc?
One thing that is worth knowing is that using a function in the where statement on the indexed field will cause the index not to be used.
Example (Oracle):
SELECT indexed_text FROM your_table WHERE upper(indexed_text) = 'UPPERCASE TEXT';

To aggregate or not to aggregate, that is the database schema design question

If you're doing min/max/avg queries, do you prefer to use aggregation tables or simply query across a range of rows in the raw table?
This is obviously a very open-ended question and there's no one right answer, so I'm just looking for people's general suggestions. Assume that the raw data table consists of a timestamp, a numeric foreign key (say a user id), and a decimal value (say a purchase amount). Furthermore, assume that there are millions of rows in the table.
I have done both and am torn. On one hand aggregation tables have given me significantly faster queries but at the cost of a proliferation of additional tables. Displaying the current values for an aggregated range either requires dropping entirely back to the raw data table or combining more fine grained aggregations. I have found that keeping track in the application code of which aggregation table to query when is more work that you'd think and that schema changes will be required, as the original aggregation ranges will invariably not be enough ("But I wanted to see our sales over the last 3 pay periods!").
On the other hand, querying from the raw data can be punishingly slow but lets me be very flexible about the data ranges. When the range bounds change, I simply change a query rather than having to rebuild aggregation tables. Likewise the application code requires fewer updates. I suspect that if I was smarter about my indexing (i.e. always having good covering indexes), I would be able to reduce the penalty of selecting from the raw data but that's by no means a panacea.
Is there anyway I can have the best of both worlds?
We had that same problem and ran into the same issues you ran into. We ended up switching our reporting to Analysis Services. There is a learning curve with MDX and Analysis services itself, but it's been great. Some of the benefits we have found are:
You have a lot of flexibility for
querying any way you want. Before we
had to build specific aggregates,
but now one cube answers all our
questions.
Storage in a cube is far smaller
than the detailed data.
Building and processing the cubes
takes less time and produces less
load on the database servers than
the aggregates did.
Some CONS:
There is a learning curve around
building cubes and learning MDX.
We had to create some tools to
automate working with the cubes.
UPDATE:
Since you're using MySql, you could take a look at Pentaho Mondrian, which is an open source OLAP solution that supports MySql. I've never used it though, so I don't know if it will work for you or not. Would be interested in knowing if it works for you though.
It helps to pick a good primary key (ie [user_id, used_date, used_time]). For a constant user_id it's then very fast to do a range-condition on used_date.
But as the table grows, you can reduce your table-size by aggregating to a table like [user_id, used_date]. For every range where the time-of-day doesn't matter you can then use that table. An other way to reduce the table-size is archiving old data that you don't (allow) querying anymore.
I always lean towards raw data. Once aggregated, you can't go back.
Nothing to do with deletion - unless there's the simplest of aggregated data sets, you can't accurately revert/transpose the data back to raw.
Ideally, I'd use a materialized view (assuming that the data can fit within the constraints) because it is effectively a table. But MySQL doesn't support them, so the next consideration would be a view with the computed columns, or a trigger to update an actual table.
Long history question, for currently, I found this useful, answered by microstrategy engineer
BTW, another already have solutions like (cube.dev/dremio) you don't have to do by yourself.

What generic techniques can be applied to optimize SQL queries?

What techniques can be applied effectively to improve the performance of SQL queries? Are there any general rules that apply?
Use primary keys
Avoid select *
Be as specific as you can when building your conditional statements
De-normalisation can often be more efficient
Table variables and temporary tables (where available) will often be better than using a large source table
Partitioned views
Employ indices and constraints
Learn what's really going on under the hood - you should be able to understand the following concepts in detail:
Indexes (not just what they are but actually how they work).
Clustered indexes vs heap allocated tables.
Text and binary lookups and when they can be in-lined.
Fill factor.
How records are ghosted for update/delete.
When page splits happen and why.
Statistics, and how they effect various query speeds.
The query planner, and how it works for your specific database (for instance on some systems "select *" is slow, on modern MS-Sql DBs the planner can handle it).
The biggest thing you can do is to look for table scans in sql server query analyzer (make sure you turn on "show execution plan"). Otherwise there are a myriad of articles at MSDN and elsewhere that will give good advice.
As an aside, when I started learning to optimize queries I ran sql server query profiler against a trace, looked at the generated SQL, and tried to figure out why that was an improvement. Query profiler is far from optimal, but it's a decent start.
There are a couple of things you can look at to optimize your query performance.
Ensure that you just have the minimum of data. Make sure you select only the columns you need. Reduce field sizes to a minimum.
Consider de-normalising your database to reduce joins
Avoid loops (i.e. fetch cursors), stick to set operations.
Implement the query as a stored procedure as this is pre-compiled and will execute faster.
Make sure that you have the correct indexes set up. If your database is used mostly for searching then consider more indexes.
Use the execution plan to see how the processing is done. What you want to avoid is a table scan as this is costly.
Make sure that the Auto Statistics is set to on. SQL needs this to help decide the optimal execution. See Mike Gunderloy's great post for more info. Basics of Statistics in SQL Server 2005
Make sure your indexes are not fragmented. Reducing SQL Server Index Fragmentation
Make sure your tables are not fragmented. How to Detect Table Fragmentation in SQL Server 2000 and 2005
Use a with statment to handle query filtering.
Limit each subquery to the minimum number of rows possible.
then join the subqueries.
WITH
master AS
(
SELECT SSN, FIRST_NAME, LAST_NAME
FROM MASTER_SSN
WHERE STATE = 'PA' AND
GENDER = 'M'
),
taxReturns AS
(
SELECT SSN, RETURN_ID, GROSS_PAY
FROM MASTER_RETURNS
WHERE YEAR < 2003 AND
YEAR > 2000
)
SELECT *
FROM master,
taxReturns
WHERE master.ssn = taxReturns.ssn
A subqueries within a with statement may end up as being the same as inline views,
or automatically generated temp tables. I find in the work I do, retail data, that about 70-80% of the time, there is a performance benefit.
100% of the time, there is a maintenance benefit.
I think using SQL query analyzer would be a good start.
In Oracle you can look at the explain plan to compare variations on your query
Make sure that you have the right indexes on the table. if you frequently use a column as a way to order or limit your dataset an index can make a big difference. I saw in a recent article that select distinct can really slow down a query, especially if you have no index.
The obvious optimization for SELECT queries is ensuring you have indexes on columns used for joins or in WHERE clauses.
Since adding indexes can slow down data writes you do need to monitor performance to ensure you don't kill the DB's write performance, but that's where using a good query analysis tool can help you balanace things accordingly.
Indexes
Statistics
on microsoft stack, Database Engine Tuning Advisor
Some other points (Mine are based on SQL server, since each db backend has it's own implementations they may or may not hold true for all databases):
Avoid correlated subqueries in the select part of a statement, they are essentially cursors.
Design your tables to use the correct datatypes to avoid having to apply functions on them to get the data out. It is far harder to do date math when you store your data as varchar for instance.
If you find that you are frequently doing joins that have functions in them, then you need to think about redesigning your tables.
If your WHERE or JOIN conditions include OR statements (which are slower) you may get better speed using a UNION statement.
UNION ALL is faster than UNION if (And only if) the two statments are mutually exclusive and return the same results either way.
NOT EXISTS is usually faster than NOT IN or using a left join with a WHERE clause of ID = null
In an UPDATE query add a WHERE condition to make sure you are not updating values that are already equal. The difference between updating 10,000,000 records and 4 can be quite significant!
Consider pre-calculating some values if you will be querying them frequently or for large reports. A sum of the values in an order only needs to be done when the order is made or adjusted, rather than when you are summarizing the results of 10,000,000 million orders in a report. Pre-calculations should be done in triggers so that they are always up-to-date is the underlying data changes. And it doesn't have to be just numbers either, we havea calculated field that concatenates names that we use in reports.
Be wary of scalar UDFs, they can be slower than putting the code in line.
Temp table tend to be faster for large data set and table variables faster for small ones. In addition you can index temp tables.
Formatting is usually faster in the user interface than in SQL.
Do not return more data than you actually need.
This one seems obvious but you would not believe how often I end up fixing this. Do not join to tables that you are not using to filter the records or actually calling one of the fields in the select part of the statement. Unnecessary joins can be very expensive.
It is an very bad idea to create views that call other views that call other views. You may find you are joining to the same table 6 times when you only need to once and creating 100,000,00 records in an underlying view in order to get the 6 that are in your final result.
In designing a database, think about reporting not just the user interface to enter data. Data is useless if it is not used, so think about how it will be used after it is in the database and how that data will be maintained or audited. That will often change the design. (This is one reason why it is a poor idea to let an ORM design your tables, it is only thinking about one use case for the data.) The most complex queries affecting the most data are in reporting, so designing changes to help reporting can speed up queries (and simplify them) considerably.
Database-specific implementations of features can be faster than using standard SQL (That's one of the ways they sell their product), so get to know your database features and find out which are faster.
And because it can't be said too often, use indexes correctly, not too many or too few. And make your WHERE clauses sargable (Able to use indexes).