Large Denormalized Table Optimization - sql

I have a single large denormalized table that mirrors the make up of a fixed length flat file that is loaded yearly. 112 columns and 400,000 records. I have a unique clustered index on the 3 columns that make up the where clause of the query that is run most against this table. Index Frag is .01. Performance on the query is good, sub second. However, returning all the records takes almost 2 minutes. The execution plan shows 100% of the cost is on a Clustered Index Scan (not seek).
There are no queries that require a join (due to the denorm). The table is used for reporting. All fields are type nvarchar (of the length of the field in the data file).
Beyond normalizing the table. What else can I do to improve performance.

Try paginating the query. You can split the results into, let's say, groups of 100 rows. That way, your users will see the results pretty quickly. Also, if they don't need to see all the data every time they view the results, it will greatly cut down the amount of data retrieved.
Beyond this, adding parameters to the query that filter the data will reduce the amount of data returned.
This post is a good way to get started with pagination: SQL Pagination Query with order by
Just replace the "50" and "100" in the answer to use page variables and you're good to go.

Here are three ideas. First, if you don't need nvarchar, switch these to varchar. That will halve the storage requirement and should make things go faster.
Second, be sure that the lengths of the fields are less than nvarchar(4000)/varchar(8000). Anything larger causes the values to be stored on a separate page, increasing retrieval time.
Third, you don't say how you are retrieving the data. If you are bringing it back into another tool, such as Excel, or through ODBC, there may be other performance bottlenecks.
In the end, though, you are retrieving a large amount of data, so you should expect the time to be much longer than for retrieving just a handful of rows.

When you ask for all rows, you'll always get a scan.
400,000 rows X 112 columns X 17 bytes per column is 761,600,000 bytes. (I pulled 17 out of thin air.) Taking two minutes to move 3/4 of a gig across the network isn't bad. That's roughly the throughput of my server's scheduled backup to disk.
Do you have money for a faster network?

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

AWS Redshift column limit?

I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.
What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.
More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.
SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'
My three questions are:
What's the deal? Why does Redshift have this limitation if it claims to be a column store?
Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.
I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.
I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.
1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.
Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?
FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).

What's the curve for a simple select query?

This is a conceptual question.
Hypothetically, when do select * from table_name where the table has 1 million records it takes about 3 secs.
Similarly, when I select 10 million records the time taken is about 30 secs. But I am told the selection of records is not linearly proportional to time. After a certain number, the time required to select records increases exponentially?
Please help me understand how this works?
THere are things that can make one query take longer than the other even simple selects with no where clauses or joins.
First, the time to return the query depends on how busy the network is at the time the query is run. It could also depend on whether there are any locks on the data or how much memory is available.
It also depends on how wide the tables are and in general how many bytes an individual record would have. For instance I would expect that a 10 million record table that only has two columns both ints would return much faster than a million record table that has 50 columns including some large columns epecially if they are things like documents stored as database objects or large fields that have too much text to fit into an ordinary varchar or nvarchar field (in sql server these would be nvarchar(max) or text for instance). I would expect this becasue there is simply less total data to return even though more records.
As you start adding where clauses and joins of course there are many more things that affect performance of an indivuidual query. If you query datbases, you should read a good book on performance tuning for your particular database. There are many things you can do without realizing it that can cause queries to run more slowly than need be. You should learn the techniques that create the queries most likely to be performant.
I think this is different for each database-server. Try to monitor the performance while you fire your queries (what happens to the memory, and CPU?)
Eventually all hardware components have a bottleneck. If you come close to that point the server might 'suffocate'.

Optimal row size to fetch at a time from a big table

I have a very big table contains around 20 million rows. I have to fetch some 4 million rows from this table based on some filtering criteria.
All the columns in filtering criteria are covered by some index and table stats are upto date.
I have been suggested that instead of loading all rows in a single go, use a batch size e.g. say 80000 rows at a time and that will be faster compared to loading all the rows at a time.
Can you suggest if this idea makes sense?
If it makes sense, what will be optimal row size to load at a time.
It can be much faster than single sql.
Split data using PK.
Batch size. It depends on the length of lines and processing time. Start with 10 000.
Thread job if possible.
Use SSIS to manipulate your data...it does everything you are wanting like threading and optimizations on load sizing and cache.
Spin up a cube or look into Business Intelligence Data Warehouse Tools...

Performance of returning entire tables containing blog text as opposed to selecting specific columns

I think this is a pretty common scenario: I have a webpage that's returning links and excerpts to the 10 most recent blog entries.
If I just queried the entire table, I could use my ORM mapped object, but I'd be downloading all the blog text.
If I restricted the query to just the columns that I need, I'd be defining another class that'll hold just those required fields.
How bad is the performance hit if I were to query entire rows? Is it worth selecting just what I need?
The answer is "it depends".
There are two things that affect performance as far as column selection:
Are there covering indexes? E.g. if there is an index containing ALL of the columns in the smaller query, then a smaller column set would be extremely benefifical performance wise, since the index would be read without reading any rows themselves.
Size of columns. Basically, count how big the size of the entire row is, vs. size of only the columns in smaller query.
If the ratio is significant (e.g. full row is 3x bigger), then you might have significant savings in both IO (for retrieval) and network (for transmission) cost.
If the ratio is more like 10% benefit, it might not be worth it as far as DB performance gain.
It depends, but it will never be as efficient as returning only the columns you need (obviously). If there are few rows and the row sizes are small, then network bandwidth won't be affected too badly.
But, returning only the columns you need increases the chance that there is a covering index that can be used to satisfy the query, and that can make a big difference in the time a query takes to execute.
,Since you specify that it's for 10 records, the answer changes from "It Depends" to "Don't spend even a second worrying about this".
Unless your server is in another country on a dialup connection, wire time for 10 records will be zero, regardless of how many bytes you shave off each row. It's simply not something worth optimizing for.
So for this case, you get to set your ORM free to grab you those records in the least efficient manner it can come up with. If your situation changes, and you suddenly need more than, say, 1000 records at once, then you can come back and we'll make fun of you for not specifying columns, but for now you get a free pass.
For extra credit, once you start issuing this homepage query more than 10x per second, you can add caching on the server to avoid repeatedly hitting the database. That'll get you a lot more bang for your buck than optimizing the query.