AWS Redshift column limit? - sql

I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.
What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.
More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.
SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'
My three questions are:
What's the deal? Why does Redshift have this limitation if it claims to be a column store?
Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.

I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.
I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.
1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.
Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?
FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

Millions of tables on Google BigQuery

I'm using BigQuery for ~5 billion rows that can be partitioned on ~1 million keys.
Since our queries are usually by partition key, is it possible to create ~1 million tables (1 table / key) to limit the total number of bytes processed?
We also need to query all of the data together at times, which is easy to do by putting it all in one table, but I'm hoping to use the same platform for partitioned analysis as bulk analytics.
That might work, but partitioning your table this finely is highly discouraged. You might be better off partitioning your data into a smaller number of tables, say 10 or 100, and querying just the one(s) you need.
What do I mean by discouraged? First, each of those million tables will get charged a minimum of 10 MB for storage. So you'll get charged for 9 TB of storage, when you likely have a lot less data than that. Second, you'll likely hit rate limits when you try to create that many tables. Third, managing a million tables is very tricky; the BigQuery UI will likely not be much help. Fourth, you'll make engineers on the BigQuery exceedingly grumpy, and they'll start trying to figure out whether we need to raise the minimum size for tables.
Also, if you do want to sometimes query all of your data, partitioning this finely is likely going to make things difficult for you, unless you are willing to store your data multiple times. You can only reference 1000 tables in a query, and each one you reference causes you to take a performance hit.

Oracle performance after removing duplicate records

I have a table in Oracle 11g R2 that contains approximately 90,000 spatial (geographic) records. Hundreds of the records are duplicated due to bad practice of the users.
Is there anyway to measure the performance of the database/table before and after removing the duplicates?
A table with 90000 records is quite little table. Hundreds of duplicates is less then 1% - it is also quite little amount of a "garbage". This amount can't make big performance problems (if your application have a good design). I don't think that you can create tests that shows any significant difference in performance between "before" and "after".
Also you can delete duplicates and then create unique constraint to prevent such situation in future.
One way to measure global performance of an Oracle database is via the facilities of the Grid Control (aka Enterprise Manager) that shows a number of measurements (CPU, IOs, memory, etc).
Another way is to run some typical queries in sqlplus (with SET TIMING ON) and compare their response times before the removal and after the removal. That is assuming that by "performance" you mean the elapsed time for those queries.
Like Dmitry said 90,000 rows is a very small table, with a tiny fraction of duplicate rows. The presence or absence of those duplicates is unlikely to make any noticeable difference.
i, create a temp table from the source table(with the indexes of
course)
ii, after it delete the duplicated rows from the temp table (or the source, its egal)
iii, see the explain
plans both of these tables and you will get the answer

What's the curve for a simple select query?

This is a conceptual question.
Hypothetically, when do select * from table_name where the table has 1 million records it takes about 3 secs.
Similarly, when I select 10 million records the time taken is about 30 secs. But I am told the selection of records is not linearly proportional to time. After a certain number, the time required to select records increases exponentially?
Please help me understand how this works?
THere are things that can make one query take longer than the other even simple selects with no where clauses or joins.
First, the time to return the query depends on how busy the network is at the time the query is run. It could also depend on whether there are any locks on the data or how much memory is available.
It also depends on how wide the tables are and in general how many bytes an individual record would have. For instance I would expect that a 10 million record table that only has two columns both ints would return much faster than a million record table that has 50 columns including some large columns epecially if they are things like documents stored as database objects or large fields that have too much text to fit into an ordinary varchar or nvarchar field (in sql server these would be nvarchar(max) or text for instance). I would expect this becasue there is simply less total data to return even though more records.
As you start adding where clauses and joins of course there are many more things that affect performance of an indivuidual query. If you query datbases, you should read a good book on performance tuning for your particular database. There are many things you can do without realizing it that can cause queries to run more slowly than need be. You should learn the techniques that create the queries most likely to be performant.
I think this is different for each database-server. Try to monitor the performance while you fire your queries (what happens to the memory, and CPU?)
Eventually all hardware components have a bottleneck. If you come close to that point the server might 'suffocate'.

Large Denormalized Table Optimization

I have a single large denormalized table that mirrors the make up of a fixed length flat file that is loaded yearly. 112 columns and 400,000 records. I have a unique clustered index on the 3 columns that make up the where clause of the query that is run most against this table. Index Frag is .01. Performance on the query is good, sub second. However, returning all the records takes almost 2 minutes. The execution plan shows 100% of the cost is on a Clustered Index Scan (not seek).
There are no queries that require a join (due to the denorm). The table is used for reporting. All fields are type nvarchar (of the length of the field in the data file).
Beyond normalizing the table. What else can I do to improve performance.
Try paginating the query. You can split the results into, let's say, groups of 100 rows. That way, your users will see the results pretty quickly. Also, if they don't need to see all the data every time they view the results, it will greatly cut down the amount of data retrieved.
Beyond this, adding parameters to the query that filter the data will reduce the amount of data returned.
This post is a good way to get started with pagination: SQL Pagination Query with order by
Just replace the "50" and "100" in the answer to use page variables and you're good to go.
Here are three ideas. First, if you don't need nvarchar, switch these to varchar. That will halve the storage requirement and should make things go faster.
Second, be sure that the lengths of the fields are less than nvarchar(4000)/varchar(8000). Anything larger causes the values to be stored on a separate page, increasing retrieval time.
Third, you don't say how you are retrieving the data. If you are bringing it back into another tool, such as Excel, or through ODBC, there may be other performance bottlenecks.
In the end, though, you are retrieving a large amount of data, so you should expect the time to be much longer than for retrieving just a handful of rows.
When you ask for all rows, you'll always get a scan.
400,000 rows X 112 columns X 17 bytes per column is 761,600,000 bytes. (I pulled 17 out of thin air.) Taking two minutes to move 3/4 of a gig across the network isn't bad. That's roughly the throughput of my server's scheduled backup to disk.
Do you have money for a faster network?