Why does Redshift need to do a full table scan to find the max value of the DIST/SORT key?

Why does Redshift need to do a full table scan to find the max value of the DIST/SORT key? - sql

I'm doing simple tests on Redshift to try and speed up the insertion of data into a Redshift table. One thing I noticed today is that doing something like this
CREATE TABLE a (x int) DISTSTYLE key DISTKEY (x) SORTKEY (x);
INSERT INTO a (x) VALUES (1), (2), (3), (4);
VACUUM a; ANALYZE a;
EXPLAIN SELECT MAX(x) FROM a;
yields
QUERY PLAN
XN Aggregate (cost=0.05..0.05 rows=1 width=4)
-> XN Seq Scan on a (cost=0.00..0.04 rows=4 width=4)
I know this is only 4 rows, but it still shouldn't be doing a full table scan to find the max value of a pre-sorted column. Isn't that metadata included in the work done by ANALYZE?
And just as a sanity check, the EXPLAIN for SELECT x FROM a WHERE x > 3 only scans 2 rows instead of the whole table.
Edit: I inserted 1,000,000 more rows into the table with random values from 1 to 10,000. Did a vacuum and analyze. The query plan still says it has to scan all 1,000,004 rows.

Analyzing query plans in a tiny data set does not yield any practical insight on how the database would perform a query.
The optimizer has thresholds and when the cost difference between different plans is small enough it stops considering alternative plans. The idea is that for simple queries, the time spent searching for the "perfect" execution plan, can possibly exceed the total execution time of a less optimal plan.
Redshift has been developed on the code for ParAccel DB. ParAccel has literally hundreds of parameters that can be changed/adjusted to optimize the database for different workloads/situations.
Since Redshift is a "managed" offering, it has these settings preset at levels deemed optimal by Amazon engineers given an "expected" workload.
In general, Redshift and ParAccel are not that great for single slice queries. These queries tend to be run in all slices anyway, even if they are only going to find data in a single slice.
Once a query is executing in a slice, the minimum amount of data read is a block. Depending on block size this can mean hundreds of thousand rows.
Remember, Redshift does not have indexes. So you are not going to have a simple record lookup that will read a few entries off an index and then go laser focused on a single page on the disk. It will always read at least an entire block for that table, and it will do that in every slice.
How to have a meaningful data set to be able to evaluate a query plan?
The short answer is that your table would have a "large number" of data blocks per slice.
How many blocks is per slice is my table going to require? The answer depends on several factors:
Number of nodes in your cluster
Type of node in the cluster - Number of slices per node
Data Type - How many bytes each value requires.
The type of compression encoding for the column involved in the
query. The optimal encoding depends on data demographics
So let's start at the top.
Redshift is an MPP Database, where processing is spread accross multiple nodes. See Redshift's architecture here.
Each node is further sub-divided in slices, which are dedicated data partitions and corresponding hardware resources to process queries on that partition of the data.
When a table is created in Redshift, and data is inserted, Redshift will allocate a minimum of one block per slice.
Here is a simple example:
If you created a cluster with two ds1.8xlarge nodes, you would have 16 slices per node times two nodes for a total of 32 slices.
Let's say we are querying and column in the WHERE clause is something like "ITEM_COUNT" an integer. An integer consumes 4 bytes.
Redshift uses a block size of 1MB.
So in this scenario, your ITEM_COUNT column would have available to it a minimum of 32 blocks times block size of 1MB which would equate to 32MB of storage.
If you have 32MB of storage and each entry only consumes 4 bytes, you can have more than 8 million entries, and they could all fit inside of a single block.
In this example in the Amazon Redshift documentation they load close to 40 million rows to evaluate and compare different encoding techniques. Read it here.
But wait.....
There is compression, if you have a 75% compression rate, that would mean that even 32 million records would still be able to fit into that single block.
What is the bottom line?
In order to analyze your query plan you would need tables, columns that have several blocks. In our example above 32 milion rows would still be a single block.
This means that in the configuration above, with all the assumptions, a table with a single record would basically most likely have the same query plan as a table with 32 million records, because, in both cases the database only needs to read a single block per slice.
If you want to understand how your data is distributed across slices and how many blocks are being used you can use the queries below:
How many rows per slice:
Select trim(name) as table_name, id, slice, sorted_rows, rows
from stv_tbl_perm
where name like '<<your-tablename>>'
order by slice;
How to count how many blocks:
select trim(name) as table_name, col, b.slice, b.num_values, count(b.slice)
from stv_tbl_perm a, stv_blocklist b
where a.id = b.tbl
and a.slice = b.slice
and name like '<<your-tablename>>'
group by 1,2,3,4
order by col, slice;

Related

Redshift performance difference between CTAS and select count

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?

When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.

CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.

AWS Redshift column limit?

I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.
What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.
More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.
SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'
My three questions are:
What's the deal? Why does Redshift have this limitation if it claims to be a column store?
Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.

I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.
I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.
1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.
Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?
FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).

How database system comes to know how many different values a particular column has?

At following link
http://www.programmerinterview.com/index.php/database-sql/selectivity-in-sql-databases/
the author has written that since "SEX" column has only two possible values thus its selectivity for 10000 records would be; according to formula given; 0.02 %.
But my question that how a database system come to know that this particular column has this many unique values? Wouldn't the database system require scanning the entire table at least once? or some other way the database system would come to know about those unique values?

First, you are applying the formula wrong. The selectivity for sex (in the example given) would be 50% not 0.02%. That means that each value appears about 50% of the time.
The general way that databases keep track of this is using something called "statistics". These are measures that are kept about all tables and used by the optimizer. Sometimes, the information can also be provided by an index on the column.

Comming back to your actual question: Yes, the database scans all table data frequently and saves some statistics, (e.g. max value, min value, number of distinct keys, number of rows in a table, etc.) in a internal table. These statistics are used to estimate the basic result of your query (or other DML operations) in order to evalutat the optimal execution plan. You can manually trigger generation of statistic by running command EXEC DBMS_STATS.GATHER_DATABASE_STATS; or some of the other ones. You can advise Oracle also to read only a sample of all data (e.g. 10% of all rows)
Usually data content does not change drastically, so it does not matter if the numbers are not absolutly exact, they are (usually) sufficient to estimate an execution plan.

Oracle has many processes related to calculating the number of distinct values (NDV).
Manual Statistics Gathering: Statistics gathering can be triggered manually, through many different procedures in DBMS_STATS.
AUTOTASK: Since 10g Oracle has a default AUTOTASK job, "auto optimizer stats collection". It will only gather statistics if the current stats are stale.
Bulk Load: In 12c statistics can be gathered during a bulk load.
Sample: The NDV can be computed from 100% of the data or can be estimated based on a sample. The sample can be either based on blocks or rows.
One-pass distinct sampling: 11g introduced a new AUTO_SAMPLE_SIZE algorithm. It scans the entire table but only uses one pass. It's much faster to scan the whole table than to have to sort even a small part of it. There are several more in-depth descriptions of the algorithm, such as this one.
Incremental Statistics: For partitioned tables Oracle can store extra information about the NDV, called a synopsis. With this information, if only a single partition is modified, only that one partition needs to be analyzed to generate both partition and global statistics.
Index NDV: Index statistics are created by default when an index is created. Also, the information can be periodically re-gathered from DBMS_STATS.GATHER_INDEX_STATS or the cascade option in other procedures in DBMS_STATS.
Custom Statistics: The NDV can be manually set with DBMS_STATS.SET_* or ASSOCIATE STATISTICS.
Dynamic Sampling: Right before a query is executed, Oracle can automatically sample a small number of blocks from the table to estimate the NDV. This usually only happens when statistics are missing.

Database scans the data set in a table so it can use the most efficient method to retrieve data. Database measures the uniqueness of values using the following formula:
Index Selectivity = number of distinct values / the total number of values
The result will be between zero or one. Index Selectivity of zero means that there are not any unique values. In these cases indexes actually reduce performance. So database uses sequential scanning instead of seek operations.
For more information on indexes read https://dba.stackexchange.com/questions/42553/index-seek-vs-index-scan

Querying Postgresql with a very large result set

In an application I need to query a Postgres DB where I expect tens or even hundreds of millions of rows in the result set. I might do this query once a day, or even more frequently. The query itself is relatively simple, although may involve a few JOINs.
My question is: How smart is Postgres with respect to avoiding having to seek around the disk for each row of the result set? Given the time required for a hard disk seek, this could be extremely expensive.
If this isn't an issue, how does Postgres avoid it? How does it know how to lay out data on the disk such that it can be streamed out in an efficient manner in response to this query?

When PostgreSQL analyzes your data, one of the statistics calculated, and used by the query planner is the correlation between the ordering of values in your field or index, and the order on disk.
Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is NULL if the column data type does not have a < operator.)
The index cost estimation functions also calculate a correlation:
The indexCorrelation should be set to the correlation (ranging between -1.0 and 1.0) between the index order and the table order. This is used to adjust the estimate for the cost of fetching rows from the parent table.
I don't know for sure, but I assume that the correlation values from various possible plans are used by the planner when determining whether the number of rows to be read from a table can be done with lower cost by performing a table scan, with sequential io (possibly joining in with another concurrent scan of the same table), filtering for the required rows, or an index scan, with its resulting seeks.
PostgreSQL doesn't keep tables sorted according to any particular key, but they can periodically be recreated in a particular index order using the CLUSTER command (which will be slow, with a disk seek per row, if the data to cluster has low correlation to the index values order).
PostgreSQL is able to effectively collect a set of disk blocks that need retrieving, then obtain them in physical order to reduce seeking. It does this through Bitmap Scans. Release Notes for 8.1 say:
Bitmap scans are useful even with a single index, as they reduce the amount of random access needed; a bitmap index scan is efficient for retrieving fairly large fractions of the complete table, whereas plain index scans are not.
Edit: I meant to mention the planner cost contants seq_page_cost and random_page_cost that inform the planner of the relative costs of performing a disk page fetch that is part of a series of sequential fetches, vs. a non-sequentially-fetched disk page.

Different execution plan for similar queries

I am running two very similar update queries but for a reason unknown to me they are using completely different execution plans. Normally this wouldn't be a problem but they are both updating exactly the same amount of rows but one is using an execution plan that is far inferior to the other, 4 secs vs 2 mins, when scaled up this is causing me a massive problem.
The only difference between the two queries is one is using the column CLI and the other DLI. These columns are exactly the same datatype, and are both indexed exactly the same, but for the DLI query execution plan, the index is not used.
Any help as to why this is happening is much appreciated.
-- Query 1
UPDATE a
SET DestKey = (
SELECT TOP 1 b.PrefixKey
FROM refPrefixDetail AS b
WHERE a.DLI LIKE b.Prefix + '%'
ORDER BY len(b.Prefix) DESC )
FROM CallData AS a
-- Query 2
UPDATE a
SET DestKey = (
SELECT TOP 1 b.PrefixKey
FROM refPrefixDetail b
WHERE a.CLI LIKE b.Prefix + '%'
ORDER BY len(b.Prefix) DESC )
FROM CallData AS a

Examine the statistics on these two columns on the table (How the data values for the columns are distributed among all the rows). This will propbably explain the difference... One of these columns may have a distribution of values that could cause the query, in processsing, to need to examine a substantially higher number of rows than would be required by the other query, (The number or rows updated is controlled by the Top 1 part remember) then it is possible that the query optimizer will choose not to use the index... Updating statistics will make them more accurate, but if the distribution of values is such that the optimizer chooses not to use the index, then you may be out of luck...
Understanding how indices work is useful here. An index is a tree-structure of nodes, where each node (starting with a root node) contains information that allows the query processor to determine which branch of the tree to go to next, based on the value it is "searching" for. It is analogous to a binary-Tree except that in databases the trees are not binary, at each level there may be more than 2 branches below each node.
So, for an index, to traverse the index, from the root to the leaf level, requires that the processor read the index once for each level in the index hiearchy. (if the index is 5 levels deep for example, it needs to do 5 I/O operations for each record it searches for.
So in this example, say, if the query need to examine more than approximately 20% of the records in the table, (based on the value distribution of the column you are searching against), then the query optimizer will say to itself, "self, to find 20% of the records, with five I/O s per each record search, is equal to the same number of I/Os as reading the entire table.", so it just ignores the index and does a Table scan.
There's really no way to avoid this except by adding additonal criteria to your query to furthur restrict the number of records the query must examine to generate it's results....

Try updating your statistics. If that does not help try rebuilding your indexes. It is possible that the cardinality of the data in each column is quite different, causing different execution plans to be selected.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas