I have data which is a matrix of integer values which indicate a banded distribution curve.
I'm optimizing for SELECT performance over INSERT performance. There are max 100 bands.
I'll primarily be querying this data by summing or averaging bands across a period of time.
My question is can I achieve better performance by flattening this data across a table with 1 column for each band, or by using a single column representing the band value?
Flattened data
UserId ActivityId DateValue Band1 Band2 Band3....Band100
10001 10002 1/1/2013 1 5 100 200
OR Normalized
UserId ActivityId DateValue Band BandValue
10001 10002 1/1/2013 1 1
10001 10002 1/1/2013 2 5
10001 10002 1/1/2013 3 100
Sample query
SELECT AVG(Band1), AVG(Band2), AVG(Band3)...AVG(Band100)
FROM ActivityBands
GROUP BY UserId
WHERE DateValue > '1/1/2012' AND DateValue < '1/1/2013'
Store the data in the normalized format.
If you are not getting acceptable performance from this scheme, instead of denormalizing, first consider what indexes you have on the table. You're likely missing an index that would make this perform similar to the denormalized table. Next, try writing a query to retrieve data from the normalized table so that the result set looks like the denormalized table, and use that query to create an indexed view. This will give you select performance identical to the denormalized table, but retain the nice data organization benefits of the proper normalization.
Denormalization optimizes exactly one means of accessing the data, at the expense of (almost all) others.
If you have only one access method that is performance critical, denormalization is likely to help; though proper index selection is of greater benefit. However, if you have multiple performance critical access paths to the data, you are better to seek other optimizations.
Creation of an appropriate clustered index; putting your non-clustered indices on SSD's. increasing memory on your server; are all techniques that will improve performance for all* accesses, rather than trading off between various accesses.
If you are accessing all (or most) of the bands in each row, then the denormalized form is better. Much better in my experience.
The reason is simple. The size of the data in the pages is much smaller, so many fewer pages need to be read to satisfy the query. The overhead for storing one band per row is about 4 integers or 32 bytes. So, 100 bands is about 3200 bytes. Within a single record, the record size is 100*4+8 or about 408 bytes. If your query is reading a significant number of records, this reduces the I/O requirements, significantly.
There is a caveat. If you only are reading one records worth, then 100 records fit on a single page in SQL and one record fits on a single page. The I/O for a single page read could be identical in the two cases. The benefit arises are you read more and more data.
Your sample query is reading hundreds or thousands of rows, so denormalization should benefit such a query.
If you would like to fetch data very fast then you should flatten out the table and use indexes to improve selecting over a broad column range similar to what you have proposed. However, If you are interested in building data for quick updates then using 3rd or 4th level normalization in combination with a lot of table joins should offer better performance.
Related
I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.
Simplified example: Two tables - people and times. Goal is to keep track of all the times a person walks through a doorway.
A person could have between 0 and 50 entries in the times table daily.
What is the proper and most efficient way to keep track of these records? Is it
times table
-----------
person_id
timestamp
I'm worried that this table can get well over a million records rather quickly. Insertion and retrieval times are of utmost importance.
ALSO: Obviously non-normalized but would it be a better idea to do
times table
-----------
person_id
serialized_timestamps_for_the_day
date
We need to access each individual timestamp for the person but ONLY query records on date or the person's id.
The second solution has some problems:
Since you need to access individual timestamps1, serialized_timestamps_for_the_day cannot be considered atomic and would violate the 1NF, causing a bunch of problems.
On top of that, you are introducing a redundancy: the date can be inferred from the contents of the serialized_timestamps_for_the_day, and your application code would need to make sure they never become "desynchronized", which is vulnerable to bugs.2
Therefore go with the first solution. If properly indexed, a modern database on modern hardware can handle much more than mere "well over a million records". In this specific case:
A composite index on {person_id, timestamp} will allow you to query for person or combination of person and date by a simple index range scan, which can be very efficient.
If you need just "by date" query, you'll need an index on {timestamp}. You can easily search for all timestamps within a specific date by searching for a range 00:00 to 24:00 of the given day.
1 Even if you don't query for individual timestamps, you still need to write them to the database one-by-one. If you have a serialized field, you first need to read the whole field to append just one value, and then write the whole result back to the database, which may become a performance problem rather quickly. And there are other problems, as mentioned in the link above.
2 As a general rule, what can be inferred should not be stored, unless there is a good performance reason to do so, and I don't see any here.
Consider what are we talking about here. Accounting for just raw data (event_time, user_id) this would be (4 + 4) * 1M ~ 8MB per 1M rows. Let's try to roughly estimate this in a DB.
One integer 4 bytes, timestamp 4 bytes; row header, say 18 bytes -- this brings the first estimate of the row size to 4 + 4 + 18 = 26 bytes. Using page fill factor of about 0.7; ==> 26 / 0.7 ~ 37 bytes per row.
So, for 1 M rows that would be about 37 MB. You will need index on (user_id, event_time), so let's simply double the original to 37 * 2 = 74 MB.
This brings the very rough, inacurate estimate to 74MB per 1M rows.
So, to keep this in memory all the time, you would need 0.074 GB for each 1M rows of this table.
To get better estimate, simply create a table, add the index and fill it with few million rows.
Given the expected data volume, this can all easily be tested with 10M rows even on a laptop -- testing always beats speculating.
P.S. Your option 2 does not look "obviously better idea" too me at all.
I think first option would be a better option.
Even if you go for second option, the size of the index might not reduce. In fact there will be an additional column.
And the data for different users is not related, you can shard the database based on person_id. i.e. let's say your data cannot be fit on a single database server node and requires two nodes. Then data for half the users will be stored on one node and rest of the data will be stored on another node.
This can be done using RDBMS like MySQL or Document oriented databases like MongoDB and OrientDB as well.
I have a single large denormalized table that mirrors the make up of a fixed length flat file that is loaded yearly. 112 columns and 400,000 records. I have a unique clustered index on the 3 columns that make up the where clause of the query that is run most against this table. Index Frag is .01. Performance on the query is good, sub second. However, returning all the records takes almost 2 minutes. The execution plan shows 100% of the cost is on a Clustered Index Scan (not seek).
There are no queries that require a join (due to the denorm). The table is used for reporting. All fields are type nvarchar (of the length of the field in the data file).
Beyond normalizing the table. What else can I do to improve performance.
Try paginating the query. You can split the results into, let's say, groups of 100 rows. That way, your users will see the results pretty quickly. Also, if they don't need to see all the data every time they view the results, it will greatly cut down the amount of data retrieved.
Beyond this, adding parameters to the query that filter the data will reduce the amount of data returned.
This post is a good way to get started with pagination: SQL Pagination Query with order by
Just replace the "50" and "100" in the answer to use page variables and you're good to go.
Here are three ideas. First, if you don't need nvarchar, switch these to varchar. That will halve the storage requirement and should make things go faster.
Second, be sure that the lengths of the fields are less than nvarchar(4000)/varchar(8000). Anything larger causes the values to be stored on a separate page, increasing retrieval time.
Third, you don't say how you are retrieving the data. If you are bringing it back into another tool, such as Excel, or through ODBC, there may be other performance bottlenecks.
In the end, though, you are retrieving a large amount of data, so you should expect the time to be much longer than for retrieving just a handful of rows.
When you ask for all rows, you'll always get a scan.
400,000 rows X 112 columns X 17 bytes per column is 761,600,000 bytes. (I pulled 17 out of thin air.) Taking two minutes to move 3/4 of a gig across the network isn't bad. That's roughly the throughput of my server's scheduled backup to disk.
Do you have money for a faster network?
I built an analytics engine that pulls 50-100 rows of raw data from my database (lets call it raw_table), runs a bunch statistical measurements on it in PHP and then comes up with exactly 140 datapoints that I then need to store in another table (lets call it results_table). All of these data points are very small ints ("40","2.23","-1024" are good examples of the types of data).
I know the maximum # of columns for mysql is quite high (4000+) but there appears to be a lot of grey area as far as when performance really starts to degrade.
So a few questions here on best performance practices:
1) The 140 datapoints could be, if it is better, broken up into 20 rows of 7 data points all with the same 'experiment_id' if fewer columns is better. HOWEVER I would always need to pull ALL 20 rows (with 7 columns each, plus id, etc) so I wouldn't think this would be better performance than pulling 1 row of 140 columns. So the question: is it better to store 20 rows of 7-9 columns (that would all need to be pulled at once) or 1 row of 140-143 columns?
2) Given my data examples ("40","2.23","-1024" are good examples of what will be stored) I'm thinking smallint for the structure type. Any feedback there, performance-wise or otherwise?
3) Any other feedback on mysql performance issues or tips is welcome.
Thanks in advance for your input.
I think the advantage to storing as more rows (i.e. normalized) depends on design and maintenance considerations in the face of change.
Also, if the 140 columns have the same meaning or if it differs per experiment - properly modeling the data according to normalization rules - i.e. how is data related to a candidate key.
As far as performance, if all the columns are used it makes very little difference. Sometimes a pivot/unpivot operation can be expensive over a large amount of data, but it makes little difference on a single key access pattern. Sometimes a pivot in the database can make your frontend code a lot simpler and backend code more flexible in the face of change.
If you have a lot of NULLs, it might be possible to eliminate rows in a normalized design and this would save space. I don't know if MySQL has support for a sparse table concept, which could come into play there.
You have a 140 data items to return every time, each of type double.
It makes no practical difference whether this is 1x140 or 20x7 or 7x20 or 4x35 etc. It could be infinitesimally quicker for one shape of course but then have you considered the extra complexity in the PHP code to deal with a different shape.
Do you have a verified bottleneck, or is this just random premature optimisation?
You've made no suggestion that you intend to store big data in the database, but for the purposes of this argument, I will assume that you have 1 billion (10^9) data points.
If you store them in 140 columns, you'll have a mere 7 millon rows, however, if you want to retrieve a single data point from lots of experiments, then it will have to fetch a large number of very wide rows.
These very wide rows will take up more space in your innodb_buffer_pool, hence you won't be able to cache so many; this will potentially slow you down when you access them again.
If you store one datapoint per row, in a table with very few columns (experiment_id, datapoint_id, value) then you'll need to pull out the same number of smaller rows.
However, the size of rows makes little difference to the number of IO operations required. If we assume that your 1 billion datapoints doesn't fit in ram (which is NOT a safe assumption nowadays), maybe the resulting performance will be approximately the same.
It is probably better database design to use few columns; but it will use less disc space and perhaps be faster to populate, if you use lots of columns.
I am working on a web API for the insurance industry and trying to work out a suitable data structure for the quoting of insurance.
The database already contains a "ratings" table which is basically:
sysID (PK, INT IDENTITY)
goods_type (VARCHAR(16))
suminsured_min (DECIMAL(9,2))
suminsured_max (DECIMAL(9,2))
percent_premium (DECIMAL(9,6))
[Unique Index on goods_type, suminsured_min and suminsured_max]
[edit]
Each type of goods typically has 3 - 4 ranges for suminsured
[/edit]
The list of goods_types rarely changes and most queries for insurance will involve goods worth less than $100. Because of this, I was considering de-normalising using tables in the following format (for all values from $0.00 through to $100.00):
Table Name: tblRates[goodstype]
suminsured (DECIMAL(9,2)) Primary Key
premium (DECIMAL(9,2))
Denormalising this data should be easy to maintain as the rates are generally only updated once per month at most. All requests for values >$100 will always be looked up in the primary tables and calculated.
My question(s) are:
1. Am I better off storing the suminsured values as DECIMAL(9,2) or as a value in cents stored in a BIGINT?
2. This de-normalisation method involves storing 10,001 values ($0.00 to $100.00 in $0.01 increments) in possibly 20 tables. Is this likely to be more efficient than looking up the percent_premium and performing a calculation? - Or should I stick with the main tables and do the calculation?
Don't create new tables. You already have an index on goods, min and max values, so this sql for (known goods and its value):
SELECT percent_premium
FROM ratings
WHERE goods='PRECIOUST' and :PREC_VALUE BETWEEN suminsured_min AND suminsured_max
will use your index efficently.
The data type you are looking for is smallmoney. Use it.
The plan you suggest will use a binary search on 10001 rows instead of 3 or 4.
It's hardly a performance improvement, don't do that.
As for arithmetics, BIGINT will be slightly faster, thought I think you will hardly notice that.
i am not entirely sure exactly what calculations we are talking about, but unless they are obnoxiously complicated, they will more than likely be much quicker than looking up data in several different tables. if possible, perform the calculations in the db (i.e. use stored procedures) to minimize the data traffic between your application layers too.
and even if the data loading would be quicker, i think the idea of having to update de-normalized data as often as once a month (or even once a quarter) is pretty scary. you can probably do the job pretty quickly, but what about the next person handling the system? would you require of them to learn the db structure, remember which of the 20-some tables that need to be updated each time, and do it correctly? i would say the possible performance gain on de-normalizing will not be worth much to the risk of contaminating the data with incorrect information.