Redshift performance difference between CTAS and select count

Redshift performance difference between CTAS and select count - sql

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?

When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.

CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.

For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

How to partition 10 billion row SQL tables quickly using AWS?

I have a SQL database of data delivered in a normalized format with several tables that have several billions of rows of data. I have decided to partition the large tables into separate tables by itemId since when I query the data I only care about 1 item at a time. I would end up having 5000+ tables at the end after partitioning the data. The problem is, partitioning the data takes about 25 minutes to build a single table for 1 item.
5000 items x 25 minutes = 86.8 days
It would take over 86 days to fully partition my entire SQL database. My entire database is about 2.5TB.
Is this something I can leverage AWS for to parallelize on an item level? Can I use AWS database migration services to host the database in its current form and then use AWS process to churn through all of the 5000 queries to partition the big tables into 5000 smaller tables with 2M rows each?
If not, is this something I just have to throw more hardware at to make it run faster (CPU or RAM)?
Thanks in advance.

This doesn't seem like a good strategy. For one thing, simple arithmetic is that 10,000,000,000 rows with 5,000 rows per item results in 2,000,000 partitions in the table.
The limit in Redshift (by default) is 1,000,000 partition per table:
Amazon Redshift Spectrum has the following quotas when using the
Athena or AWS Glue data catalog:
A maximum of 10,000 databases per account.
A maximum of 100,000 tables per database.
A maximum of 1,000,000 partitions per table.
A maximum of 10,000,000 partitions per account.
You should re-think your partitioning strategy. Or perhaps your problem is not suitable for Redshift. There may be other database strategies more suitable for your use-case. (This is not the forum for recommending specific software solutions, however.)

Use the itemid as sortkey and distkey. if the table is vacummed properly and you select one itemid this should have good results, where access time is almost as good as a single table. distkey is used to distribute the data between shards, which means each itemid's blocks would be stored together on the same shard making retrieving all of them faster. Having the itemid also be sortkey means that for itemid's with small row numbers that all exist on the same shard, finding the rows within the table's blocks on a shard would be as fast as possible.

Creating a separate table for each item, where every other attribute of the table remains the same, doesn't seem logical. If the data format is the same, then keep the data in the same table unless there is a particular problem to overcome.
If you set the itemId as the SORTKEY on a Redshift table, then Redshift will be able to skip-over the blocks that do not contain a desired value (when using WHERE itemId = 'xxx'). This will be highly efficient.
Admittedly, trying to keep such a large table sorted would probably be too hard to VACUUM. It would still work reasonably well without the SORTKEY since blocks can still be skipped, but not as efficiently because the data for that itemId would be spread over more blocks.

AWS Redshift column limit?

I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.
What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.
More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.
SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'
My three questions are:
What's the deal? Why does Redshift have this limitation if it claims to be a column store?
Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.

I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.
I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.
1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.
Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?
FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).

Optimizing sql queries for deleting duplicates in Monetdb

I have a problem where I have a market data table with >100,000,000 rows and I need to search and remove duplicates where the symbol and totvol columns match but the serial_no is different.
I have tried the query below on both a single table and also using a copy of the table for reference, but unfortunately it takes up an enormous amount of heap space (>100G and counting, sometimes filling the harddrive to the brim and crashing my database)and time (>30 mins) and brings my server to a crawl (60-95% cpu usage on 32 cores!) which is unacceptable. Is there an efficient way to write this query to optimize the sql execution if i want to execute something like this regularly?
Normally I would partition the table somehow since duplicates for the most part are inserted adjacent or near each other, but since monetdb is a column store database partitioning this way also takes up a lot of heap space. The only helpful thing I have found to reduce the heap is by creating an entirely new table with a subset of the data (i.e. split alphabetically by symbol) which results in smaller column bat files and then running the operation on the small table, is there any way I can keep the large table in tact and write a query which operates on maybe 1,000,000 rows at a time?
The query:
delete from print_11_11
where exists (Select a.serial_no
from print_11_11 as a, print_11_11 as b
where a.symbol=b.symbol
and a.totvol = b.totvol
and a.serial_no>b.serial_no)
Some example data, rows 2 and 3 are duplicates of one another and rows 4-7 are all duplicates = by my critera, note the exseq may be the same or different, it does not matter which exseq value we keep when removing duplicates:
<table border="1"><tr BGCOLOR="#CCCCFF"><th>serial_no</th><th>ttime</th><th>symbol</th><th>vol</th><th>totvol</th><th>exseq</th></tr>
<tr><td>0</td><td>80017</td><td>T</td><td>200</td><td>200</td><td>133813</td></tr>
<tr><td>855</td><td>80017</td><td>T</td><td>42</td><td>242</td><td>133813</td></tr>
<tr><td>867</td><td>80017</td><td>T</td><td>42</td><td>242</td><td>136690</td></tr>
<tr><td>868</td><td>80210</td><td>T</td><td>100</td><td>342</td><td>136690</td></tr>
<tr><td>876</td><td>80211</td><td>T</td><td>100</td><td>442</td><td>136690</td></tr>
<tr><td>877</td><td>80211</td><td>T</td><td>100</td><td>442</td><td>136696</td></tr>
<tr><td>882</td><td>80211</td><td>T</td><td>100</td><td>442</td><td>136737</td></tr>
<tr><td>883</td><td>80213</td><td>T</td><td>2928</td><td>3370</td><td>136737</td></tr>
</table>

Appropriate query and indexes for a logging table in SQL

Assume a table named 'log', there are huge records in it.
The application usually retrieves data by simple SQL:
SELECT *
FROM log
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)
logLevel and creationData have indexes, but the number of records makes it take longer to retrieve data.
How do we fix this?

Look at your execution plan / "EXPLAIN PLAN" result - if you are retrieving large amounts of data then there is very little that you can do to improve performance - you could try changing your SELECT statement to only include columns you are interested in, however it won't change the number of logical reads that you are doing and so I suspect it will only have a neglible effect on performance.
If you are only retrieving small numbers of records then an index of LogLevel and an index on CreationDate should do the trick.
UPDATE: SQL server is mostly geared around querying small subsets of massive databases (e.g. returning a single customer record out of a database of millions). Its not really geared up for returning truly large data sets. If the amount of data that you are returning is genuinely large then there is only a certain amount that you will be able to do and so I'd have to ask:
What is it that you are actually trying to achieve?
If you are displaying log messages to a user, then they are only going to be interested in a small subset at a time, and so you might also want to look into efficient methods of paging SQL data - if you are only returning even say 500 or so records at a time it should still be very fast.
If you are trying to do some sort of statistical analysis then you might want to replicate your data into a data store more suited to statistical analysis. (Not sure what however, that isn't my area of expertise)

1: Never use Select *
2: make sure your indexes are correct, and your statistics are up-to-date
3: (Optional) If you find you're not looking at log data past a certain time (in my experience, if it happened more than a week ago, I'm probably not going to need the log for it) set up a job to archive that to some back-up, and then remove unused records. That will keep the table size down reducing the amount of time it takes search the table.

Depending on what kinda of SQL database you're using, you might look into Horizaontal Partitioning. Oftentimes, this can be done entirely on the database side of things so you won't need to change your code.

Do you need all columns? First step should be to select only those you actually need to retrieve.
Another aspect is what you do with the data after it arrives to your application (populate a data set/read it sequentially/?).
There can be some potential for improvement on the side of the processing application.
You should answer yourself these questions:
Do you need to hold all the returned data in memory at once? How much memory do you allocate per row on the retrieving side? How much memory do you need at once? Can you reuse some memory?

A couple of things
do you need all the columns, people usually do SELECT * because they are too lazy to list 5 columns of the 15 that the table has.
Get more RAM, themore RAM you have the more data can live in cache which is 1000 times faster than reading from disk

For me there are two things that you can do,
Partition the table horizontally based on the date column
Use the concept of pre-aggregation.
Pre-aggregation:
In preagg you would have a "logs" table, "logs_temp" table, a "logs_summary" table and a "logs_archive" table. The structure of logs and logs_temp table is identical. The flow of application would be in this way, all logs are logged in the logs table, then every hour a cron job runs that does the following things:
a. Copy the data from the logs table to "logs_temp" table and empty the logs table. This can be done using the Shadow Table trick.
b. Aggregate the logs for that particular hour from the logs_temp table
c. Save the aggregated results in the summary table
d. Copy the records from the logs_temp table to the logs_archive table and then empty the logs_temp table.
This way results are pre-aggregated in the summary table.
Whenever you wish to select the result, you would select it from the summary table.
This way the selects are very fast, because the number of records are far less as the data has been pre-aggregated per hour. You could even increase the threshold from an hour to a day. It all depends on your needs.
Now the inserts would be fast too, because the amount of data is not much in the logs table as it holds the data only for the last hour, so index regeneration on inserts would take very less time as compared to very large data-set hence making the inserts fast.
You can read more about Shadow Table trick here
I employed the pre-aggregation method in a news website built on wordpress. I had to develop a plugin for the news website that would show recently popular (popular during the last 3 days) news items, and there are like 100K hits per day, and this pre-aggregation thing has really helped us a lot. The query time came down from more than 2 secs to under a second. I intend on making the plugin publically available soon.

As per other answers, do not use 'select *' unless you really need all the fields.
logLevel and creationData have indexes
You need a single index with both values, what order you put them in will affect performance, but assuming you have a small number of possible loglevel values (and the data is not skewed) you'll get better performance putting creationData first.
Note that optimally an index will reduce the cost of a query to log(N) i.e. it will still get slower as the number of records increases.
C.

I really hope that by creationData you mean creationDate.
First of all, it is not enough to have indexes on logLevel and creationData. If you have 2 separate indexes, Oracle will only be able to use 1.
What you need is a single index on both fields:
CREATE INDEX i_log_1 ON log (creationData, logLevel);
Note that I put creationData first. This way, if you only put that field in the WHERE clause, it will still be able to use the index. (Filtering on just date seems more likely scenario that on just log level).
Then, make sure the table is populated with data (as much data as you will use in production) and refresh the statistics on the table.
If the table is large (at least few hundred thousand rows), use the following code to refresh the statistics:
DECLARE
l_ownname VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
l_tabname VARCHAR2(255) := 'log'; -- Table to analyze
l_estimate_percent NUMBER(3) := 5; -- Percentage of rows to estimate (NULL means compute)
BEGIN
dbms_stats.gather_table_stats (
ownname => l_ownname ,
tabname => l_tabname,
estimate_percent => l_estimate_percent,
method_opt => 'FOR ALL INDEXED COLUMNS',
cascade => TRUE
);
END;
Otherwise, if the table is small, use
ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;
Additionally, if the table grows large, you shoud consider to partition it by range on creationDate column. See these links for the details:
Oracle Documentation: Range Partitioning
OraFAQ: Range partitions
How to Create and Manage Partition Tables in Oracle

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas