Is there a faster alternative to "group by" aggregation in Netezza?

Is there a faster alternative to "group by" aggregation in Netezza? - sql

This the minimal query statement I want to execute.
select count(*) from temper_300_1 group by onegid;
I do have "where" clauses to go along as well though. What I am trying to do is build a histogram query and determine the number of elements with a particular "onegid". the query takes about 7 seconds on 800 million rows. Could someone suggest a faster alternative or optimization.
I was to actually trying plot a heatmap from a spatial data consisting of latitudes and longitudes, I have assigned a grid id to each elements, but the "group by aggregation" is coming out to be pretty costly in terms of time.

You're not going to get much faster than group by, though your current query won't display which group item is associated with each count.
Make sure that the table is properly distributed with
select datasliceid, count(1) from temper_300_1 group by onegid;
The counts should be roughly equal. If they're not, your DBA needs to redistribute the table on a better distribution key.
If it is, you could ask your DBA to create a materialized view on that specific column, ordered by that column. You may see some performance gains.

I would say that there are two primary considerations for performance related to your query: distribution and row size/extent density.
Distribution:
As #jeremytwfortune mentions, it is important that your data be well distributed with little skew. In an MPP system such as Netezza, you are only as fast as your slowest data slice, and if one data slice has 10x the data as the rest it will likely drag your performance down.
The other distribution consideration is that if your table is not already distributed on onegid, it will be dynamically redistributed on onegid when the query runs in support of your GROUP BY onegid clause. This will happen for GROUP BYs and windowed aggregates with PARTITION BYs. If the distribution of onegid values is not relatively even you may be faced with processing skew.
If your table is already distributed on onegid and you don't supply any other WHERE predicates then you are probably already optimally configured from that standpoint.
Row Size / Extent Density
As Netezza reads data to support your query each data slice will read the its disk in 3 MB extents. If your row is substantially wider than just the onegid value, you will be reading more data from the disk than you need in order to answer your query. If your table is large, your rows are wider than just onegid, and query time performance is paramount, then you might consider creating a materialized view, like so:
CREATE MATERIALIZED VIEW temper_300_1_mv AS select onegid from temper_300_1 ORDER BY onegid;
When you execute your query against temper_300_1 with only onegid in the SELECT clause, the optimizer will refer to the materialized view only, which will be able to pack more rows into a given 3MB extent. This can be a significant performance boost.
The ORDER BY clause in the MVIEW creation statement will also likely increase the effectiveness of compression of the MVIEW, further reducing the number of extents required to hold a given number of rows, and further improving performance.

Related

What is the effect of partition on below query "Select * from Table_name"

I have implemented database partitioning on my database. The partitioning is done by DBMS_REDEFINITION. The question in my mind is when we execute the select * statement from the table does the partitioning affect the performance.
Tip: We are selecting entire data of a table
What is the difference between select * from Non_Partitioned_table and select * from Partitioned_table?

Partitioning is unlikely to have any affect on a query like select * from table_name;
It won't help performance.
A common misconception about partitioning is that simply dividing the table into chunks improves performance. Unless you are taking advantage of a specific partitioning feature, such as partition pruning or partition-wise joins, it's much better to let Oracle
decide how to divide the work. For example, with parallel processing, Oracle can just as easily divide up the work into block range granules instead of partition granules.
It probably won't hurt performance.
There's a little extra overhead with parsing; more metadata to gather and analyze, more optimizer decisions to evaluate. In practice I've never noticed a significant difference just from partitioning. I've definitely seen some unusual behavior with complex partition pruning decisions, but that's not relevant to this simple statement.
There's a little extra data to read; more segments may mean more wasted space. In practice, with defaults and a normal number of partitions, this extra overhead is not a big deal. Especially with deferred segment creation, where an empty partition takes up 0 bytes. Although that feature only works with partitions starting with 11.2.0.2.
But things can go very poorly in some extreme examples. For example, if you set a large and uniform extent size, or disable deferred segment creation, or create a table with thousands of partitions each with only a single row. In that case, a simple select statement could takes minutes instead of a fraction of a second.

Until you have not "Where clause" in your query there is no difference between partitioned and non-partitioned tables.
partition will work on conditions that you do partition with that conditions on tables.
for example when you partition a table by date then every query that have date condition in where clause will be affected.

Pre-fetching row counts before query - performance

I recently answered this question based on my experience:
Counting rows before proceeding to actual searching
but I'm not 100% satisfied with the answer I gave.
The question is basically this: Can I get a performance improvement by running a COUNT over a particular query before deciding to run the query that brings back the actual rows?
My intuition is this: you will only save the I/O and wire time associated with retrieving the data instead of the count because to count the data, you need to actually find the rows. The possible exception to this is when the query is a simple function of the indexes.
My question then is this: Is this always true? What other exception cases are there? From a pure performance perspective, in what cases would one want to do a COUNT before running the full query?

First, the answer to your question is highly dependent on the database.
I cannot think of a situation when doing a COUNT() before a query will shorten the overall time for both the query and the count().
In general, doing a count will pre-load tables and indexes into the page cache. Assuming the data fits in memory, this will make the subsequent query run faster (although not much faster if you have fast I/O and the database does read-ahead page reading). However, you have just shifted the time frame to the COUNT(), rather than reducing overall time.
To shorten the overall time (including the run time of the COUNT()) would require changing the execution plan. Here are two ways this could theoretically happen:
A database could update statistics as a table is read in, and these statistics, in turn, change the query plan for the main query.
A database could change the execution plan based on whether tables/indexes are already in the page cache.
Although theoretically possible, I am not aware of any database that does either of these.
You could imagine that intermediate results could be stored, but this would violate the dynamic nature of SQL databases. That is, updates/inserts could occur on the tables between the COUNT() and the query. A database engine could not maintain integrity and maintain such intermediate results.
Doing a COUNT() has disadvantages, relative to speeding up the subsequent query. The query plan for the COUNT() might be quite different from the query plan for the main query. Your example with indexes is one case. Another case would be in a columnar database, where different vertical partitions of the data do not need to be read.
Yet another case would be a query such as:
select t.*, r.val
from table t left outer join
ref r
on t.refID = r.refID
and refID is a unique index on the ref table. This join can be eliminated for a count, since there are not duplicates and all records in t are used. However, the join is clearly needed for this query. Once again, whether a SQL optimizer recognizes and acts on this situation is entirely the decision of the writers of the database. However, the join could theoretically be optimized away for the COUNT().

Are SQL Execution Plans based on Schema or Data or both?

I hope this question is not too obvious...I have already found lots of good information on interpreting execution plans but there is one question I haven't found the answer to.
Is the plan (and more specifically the relative CPU cost) based on the schema only, or also the actual data currently in the database?
I am try to do some analysis of where indexes are needed in my product's database, but am working with my own test system which does not have close to the amount of data a product in the field would have. I am seeing some odd things like the estimated CPU cost actually going slightly UP after adding an index, and am wondering if this is because my data set is so small.
I am using SQL Server 2005 and Management Studio to do the plans

It will be based on both Schema and Data. The Schema tells it what indexes are available, the Data tells it which is better.
The answer can change in small degrees depending on the DBMS you are using (you have not stated), but they all maintain statistics against indexes to know whether an index will help. If an index breaks 1000 rows into 900 distinct values, it is a good index to use. If an index only results in 3 different values for 1000 rows, it is not really selective so it is not very useful.

SQL Server is 100% cost-based optimizer. Other RDBMS optimizers are usually a mix of cost-based and rules-based, but SQL Server, for better or worse, is entirely cost driven. A rules based optimizer would be one that can say, for example, the order of the tables in the FROM clause determines the driving table in a join. There are no such rules in SQL Server. See SQL Statement Processing:
The SQL Server query optimizer is a
cost-based optimizer. Each possible
execution plan has an associated cost
in terms of the amount of computing
resources used. The query optimizer
must analyze the possible plans and
choose the one with the lowest
estimated cost. Some complex SELECT
statements have thousands of possible
execution plans. In these cases, the
query optimizer does not analyze all
possible combinations. Instead, it
uses complex algorithms to find an
execution plan that has a cost
reasonably close to the minimum
possible cost.
The SQL Server query optimizer does
not choose only the execution plan
with the lowest resource cost; it
chooses the plan that returns results
to the user with a reasonable cost in
resources and that returns the results
the fastest. For example, processing a
query in parallel typically uses more
resources than processing it serially,
but completes the query faster. The
SQL Server optimizer will use a
parallel execution plan to return
results if the load on the server will
not be adversely affected.
The query optimizer relies on
distribution statistics when it
estimates the resource costs of
different methods for extracting
information from a table or index.
Distribution statistics are kept for
columns and indexes. They indicate the
selectivity of the values in a
particular index or column. For
example, in a table representing cars,
many cars have the same manufacturer,
but each car has a unique vehicle
identification number (VIN). An index
on the VIN is more selective than an
index on the manufacturer. If the
index statistics are not current, the
query optimizer may not make the best
choice for the current state of the
table. For more information about
keeping index statistics current, see
Using Statistics to Improve Query
Performance.

Both schema and data.
It takes the statistics into account when building a query plan, using them to approximate the number of rows returned by each step in the query (as this can have an effect on the performance of different types of joins, etc).
A good example of this is the fact that it doesn't bother to use indexes on very small tables, as performing a table scan is faster in this situation.

I can't speak for all RDBMS systems, but Postgres specifically uses estimated table sizes as part of its efforts to construct query plans. As an example, if a table has two rows, it may choose a sequential table scan for the portion of the JOIN that uses that table, whereas if it has 10000+ rows, it may choose to use an index or hash scan (if either of those are available.) Incidentally, it used to be possible to trigger poor query plans in Postgres by joining VIEWs instead of actual tables, since there were no estimated sizes for VIEWs.
Part of how Postgres constructs its query plans depend on tunable parameters in its configuration file. More information on how Postgres constructs its query plans can be found on the Postgres website.

For SQL Server, there are many factors that contribute to the final execution plan. On a basic level, Statistics play a very large role but they are based on the data but not always all of the data. Statistics are also not always up to date. When creating or rebuilding an Index, the statistics should be based on a FULL / 100% sample of the data. However, the sample rate for automatic statistics refreshing is much lower than 100% so it is possible to sample a range that is in fact not representative of much of the data. Estimated number of rows for the operation also plays a role which can be based on the number of rows in the table or the statistics on a filtered operation. So out-of-date (or incomplete) Statistics can lead the optimizer to choose a less-than-optimal plan just as a few rows in a table can cause it to ignore indexes entirely (which can be more efficient).
As mentioned in another answer, the more unique (i.e. Selective) the data is the more useful the index will be. But keep in mind that the only guaranteed column to have statistics is the leading (or "left-most" or "first") column of the Index. SQL Server can, and does, collect statistics for other columns, even some not in any Indexes, but only if AutoCreateStatistics DB option is set (and it is by default).
Also, the existence of Foreign Keys can help the optimizer when those fields are in a query.
But one area not considered in the question is that of the Query itself. A query, slightly changed but still returning the same results, can have a radically different Execution Plan. It is also possible to invalidate the use of an Index by using:
LIKE '%' + field
or wrapping the field in a function, such as:
WHERE DATEADD(DAY, -1, field) < GETDATE()
Now, keep in mind that read operations are (ideally) faster with Indexes but DML operations (INSERT, UPDATE, and DELETE) are slower (taking more CPU and Disk I/O) as the Indexes need to be maintained.
Lastly, the "estimated" CPU, etc. values for cost are not always to be relied upon. A better test is to do:
SET STATISTICS IO ON
run query
SET STATISTICS IO OFF
and focus on "logical reads". If you reduce Logical Reads then you should be improving performance.
You will, in the end, need a set of data that comes somewhat close to what you have in Production in order to performance tune with regards to both Indexes and the Queries themselves.

Oracle specifics:
The stated cost is actually an estimated execution time, but it is given in a somewhat arcane unit of measure that has to do with estimated time for block reads. It's important to realize that the calculated cost doesn't say much about the runtime anyway, unless each and every estimate made by the optimizer was 100% perfect (which is never the case).
The optimizer uses the schema for a lot of things when deciding what transformations/heuristics can be applied to the query. Some examples of schema things that matter a lot when evaluating xplans:
Foreign key constraints (can be used for table elimiation)
Partitioning (exclude entire ranges of data)
Unique constraints (index unique vs range scans for example)
Not null constraints (anti-joins are not available with not in() on nullable columns
Data types (type conversions, specialized date arithmetics)
Materialized views (for rewriting a query against an aggregate)
Dimension Hierarchies (to determine functional dependencies)
Check constraints (the constraint is injected if it lowers cost)
Index types (b-tree(?), bitmap, joined, function based)
Column order in index (a = 1 on {a,b} = range scan, {b,a} = skip scan or FFS)
The core of the estimates comes from using the statistics gathered on actual data (or cooked). Statistics are gathered for tables, columns, indexes, partitions and probably something else too.
The following information is gathered:
Nr of rows in table/partition
Average row/col length (important for costing full scans, hash joins, sorts, temp tables)
Number of nulls in col (is_president = 'Y' is pretty much unique)
Distinct values in col (last_name is not very unique)
Min/max value in col (helps unbounded range conditions like date > x)
...to help estimate the nr of expected rows/bytes returned when filtering data. This information is used to determine what access paths and join mechanisms are available and suitable given the actual values from the SQL query compared to the statistics.
On top of all that, there is also the physical row order which affects how "good" or attractive an index become vs a full table scan. For indexes this is called "clustering factor" and is a measure of how much the row order matches the order of the index entries.

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.

Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.

I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.

Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.

It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).

yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer

To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.

Temp tables and SQL SELECT performance

Why does the use of temp tables with a SELECT statement improve the logical I/O count? Wouldn't it increase the amount of hits to a database instead of decreasing it. Is this because the 'problem' is broken down into sections? I'd like to know what's going on behind the scenes.

There's no general answer. It depends on how the temp table is being used.
The temp table may reduce IO by caching rows created after a complex filter/join that are used multiple times later in the batch. This way, the DB can avoid hitting the base tables multiple times when only a subset of the records are needed.
The temp table may increase IO by storing records that are never used later in the query, or by taking up a lot of space in the engine's cache that could have been better used by other data.
Creating a temp table to use all of its contents once is slower than including the temp's query in the main query because the query optimizer can't see past the temp table and it forces a (probably) unnecessary spool of the data instead of allowing it to stream from the source tables.

I'm going to assume by temp tables you mean a sub-select in a WHERE clause. (This is referred to as a semijoin operation and you can usually see that in the text execution plan for your query.)
When the query optimizer encounter a sub-select/temp table, it makes some assumptions about what to do with that data. Essentially, the optimizer will create an execution plan that performs a join on the sub-select's result set, reducing the number of rows that need to be read from the other tables. Since there are less rows, the query engine is able to read less pages from disk/memory and reduce the amount of I/O required.

AFAIK, at least with mysql, tmp tables are kept in RAM, making SELECTs much faster than anything that hits the HD

There are a class of problems where building the result in a collection structure on the database side is much preferable to returning the result's parts to the client, roundtripping for each part.
For example: arbitrary depth recursive relationships (boss of)
There's another class of query problems where the data is not and will not be indexed in a manner that makes the query run efficiently. Pulling results into a collection structure, which can be indexed in a custom way, will reduce the logical IO for these queries.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas