200 column table - 3 million rows - performance - sql

i'm currently working on a project where the client has handed me a database that includes a table with over 200 columns and 3 million rows of data lol. This is definitely poorly designed and currently exploring some options. I developed the app on my 2012 mbp with 16gb of ram and an 512 ssd. I had to develop the app using mvc4 so set up the development and test environment using parallels 8 on osx. As part of the design, I developed an interface for the client to create custom queries to this large table with hundreds of rows so I am sending a queryString to the controller which is passed using dynamic linq and the results are sent to the view using JSON (to populate a kendo ui grid). On my mbp, when testing queries using the interface i created it takes max 10 secs (which find too much) to return the results to my kendo ui grid. Similarly, when I test queries directly in sql server, it never takes really long.
However when I deployed this to the client for testing these same queries take in excess of 3 mins +. So long story short, the client will be upgrading the server hardware but in the mean time they still need to test the app.
My question is, despite the fact that the table holds 200 columns, each row is unique. More specifically, the design is:
PK-(GUID) OrganizationID (FK) --- 200 columns (tax fields)
If I redesign this to:
PK (GUID) OrganizationID (FK) FieldID(FK) Input
Field table:
FieldID FieldName
This would turn this 3 million rows of data table into 600 million rows but only 3 columns. Will I see performance enhancements?
Any insight would be appreciated - I understand normalization but most of my experience is in programming.
Thanks in advance!

It is very hard to make any judgements without knowing the queries that you are running on the table.
Here are some considerations:
Be sure that the queries are using indexes if they are returning only a handful of rows.
Check that you have enough memory to store the table in memory.
When doing timings, be sure to ignore the first run, because this is just loading the page cache.
For testing purposes, just reduce the size of the table. That should speed things up.
As for your question about normalization. Your denormalized structure takes up much less disk space than a normalized structure, because you do not need to repeat the keys for each value. If you are looking for one value on one row, normalization will not help you. You will still need to scan the index to find the row and then load the row. And, the row will be on one page, regardless of whether it is normalized or denormalized. In fact, normalization might be worse, because the index will be much larger.
There are some examples of queries where normalizing the data will help. But, in general, you already have a more efficient data structure if you are fetching the data by rows.

You can take a paging approach. There will be 2 queries: initial will return all rows but only a column with unique IDs. This array can be split into pages, say 100 IDs per page. When user selects a specific page - you pass 100 ids to the second query which this time will return all 200 columns but only for requested 100 rows. This way you don't have to return all the columns across all the rows at once, which should yield significant performance boost.

Related

Performance Issue : Sql Views vs Linq For Decreasing Query Execution Time

I am having a System Setup in ASP.NET Webforms and there is Acccounts Records Generation Form In Some Specific Situation I need to Fetch All Records that are near to 1 Million .
One solution could be to reduce number of records to fetch but when we need to fetch records for more than a year of 5 years that time records are half million, 1 million etc. How can I decrease its time?
What could be points that I can use to reduce its time? I can't show full query here, it's a big view that calls some other views in it
Does it take less time if I design it in as a Linq query? That's why I asked Linq vs Views
I have executed a "Select * from TableName" Query and its 40 mins and its still executing table is having 1,17,000 Records Can we decrease this timeline
I started this as a comment but ran out of room.
Use the server to do as much filtering for you as possible and return as few rows as possible. Client side filtering is always going to be much slower than server side filtering. Eg, it does not have access to the indexes & optimisation techniques that exist on the server.
Linq uses "lazy evaluation" which means that it builds up a method for filtering but does not execute it until it is forced to. I've used it and was initially impressed with the speed ... until I started to access the data it returned. When you use the data you want from Linq, this will trigger the actual selection process, which you'll find is slow.
Use the server to return a series of small resultsets and then process those. If you need to join these resultsets on a key, save them into dictionaries with that key so you can join them quickly.
Another approach is to look at Entity Framework to create a mirror of the server database structure along with indexes so that the subset of data you retrieve can be joined quickly.

Large Denormalized Table Optimization

I have a single large denormalized table that mirrors the make up of a fixed length flat file that is loaded yearly. 112 columns and 400,000 records. I have a unique clustered index on the 3 columns that make up the where clause of the query that is run most against this table. Index Frag is .01. Performance on the query is good, sub second. However, returning all the records takes almost 2 minutes. The execution plan shows 100% of the cost is on a Clustered Index Scan (not seek).
There are no queries that require a join (due to the denorm). The table is used for reporting. All fields are type nvarchar (of the length of the field in the data file).
Beyond normalizing the table. What else can I do to improve performance.
Try paginating the query. You can split the results into, let's say, groups of 100 rows. That way, your users will see the results pretty quickly. Also, if they don't need to see all the data every time they view the results, it will greatly cut down the amount of data retrieved.
Beyond this, adding parameters to the query that filter the data will reduce the amount of data returned.
This post is a good way to get started with pagination: SQL Pagination Query with order by
Just replace the "50" and "100" in the answer to use page variables and you're good to go.
Here are three ideas. First, if you don't need nvarchar, switch these to varchar. That will halve the storage requirement and should make things go faster.
Second, be sure that the lengths of the fields are less than nvarchar(4000)/varchar(8000). Anything larger causes the values to be stored on a separate page, increasing retrieval time.
Third, you don't say how you are retrieving the data. If you are bringing it back into another tool, such as Excel, or through ODBC, there may be other performance bottlenecks.
In the end, though, you are retrieving a large amount of data, so you should expect the time to be much longer than for retrieving just a handful of rows.
When you ask for all rows, you'll always get a scan.
400,000 rows X 112 columns X 17 bytes per column is 761,600,000 bytes. (I pulled 17 out of thin air.) Taking two minutes to move 3/4 of a gig across the network isn't bad. That's roughly the throughput of my server's scheduled backup to disk.
Do you have money for a faster network?

Improving query performance of of database table with large number of columns and rows(50 columns, 5mm rows)

We are building an caching solution for our user data. The data is currently stored i sybase and is distributed across 5 - 6 tables but query service built on top of it using hibernate and we are getting a very poor performance. In order to load the data into the cache it would take in the range of 10 - 15 hours.
So we have decided to create a denormalized table of 50 - 60 columns and 5mm rows into another relational database (UDB), populate that table first and then populate the cache from the new denormalized table using JDBC so the time to build us cache is lower. This gives us a lot better performance and now we can build the cache in around an hour but this also does not meet our requirement of building the cache whithin 5 mins. The denormlized table is queried using the following query
select * from users where user id in (...)
Here user id is the primary key. We also tried a query
select * from user where user_location in (...)
and created a non unique index on location also but that also did not help.
So is there a way we can make the queries faster. If not then we are also open to consider some NOSQL solutions.
Which NOSQL solution would be suited for our needs. Apart from the large table we would be making around 1mm updates on the table on a daily basis.
I have read about mongo db and seems that it might work but no one has posted any experience with mongo db with so many rows and so many daily updates.
Please let us know your thoughts.
The short answer here, relating to MongoDB, is yes - it can be used in this way to create a denormalized cache in front of an RDBMS. Others have used MongoDB to store datasets of similar (and larger) sizes to the one you described, and can keep a dataset of that size in RAM. There are some details missing here in terms of your data, but it is certainly not beyond the capabilities of MongoDB and is one of the more frequently used implementations:
http://www.mongodb.org/display/DOCS/The+Database+and+Caching
The key will be the size of your working data set and therefore your available RAM (MongoDB maps data into memory). For larger solutions, write heavy scaling, and similar issues, there are numerous approaches (sharding, replica sets) that can be employed.
With the level of detail given it is hard to say for certain that MongoDB will meet all of your requirements, but given that others have already done similar implementations and based on the information given there is no reason it will not work either.

mysql - Creating rows vs. columns performance

I built an analytics engine that pulls 50-100 rows of raw data from my database (lets call it raw_table), runs a bunch statistical measurements on it in PHP and then comes up with exactly 140 datapoints that I then need to store in another table (lets call it results_table). All of these data points are very small ints ("40","2.23","-1024" are good examples of the types of data).
I know the maximum # of columns for mysql is quite high (4000+) but there appears to be a lot of grey area as far as when performance really starts to degrade.
So a few questions here on best performance practices:
1) The 140 datapoints could be, if it is better, broken up into 20 rows of 7 data points all with the same 'experiment_id' if fewer columns is better. HOWEVER I would always need to pull ALL 20 rows (with 7 columns each, plus id, etc) so I wouldn't think this would be better performance than pulling 1 row of 140 columns. So the question: is it better to store 20 rows of 7-9 columns (that would all need to be pulled at once) or 1 row of 140-143 columns?
2) Given my data examples ("40","2.23","-1024" are good examples of what will be stored) I'm thinking smallint for the structure type. Any feedback there, performance-wise or otherwise?
3) Any other feedback on mysql performance issues or tips is welcome.
Thanks in advance for your input.
I think the advantage to storing as more rows (i.e. normalized) depends on design and maintenance considerations in the face of change.
Also, if the 140 columns have the same meaning or if it differs per experiment - properly modeling the data according to normalization rules - i.e. how is data related to a candidate key.
As far as performance, if all the columns are used it makes very little difference. Sometimes a pivot/unpivot operation can be expensive over a large amount of data, but it makes little difference on a single key access pattern. Sometimes a pivot in the database can make your frontend code a lot simpler and backend code more flexible in the face of change.
If you have a lot of NULLs, it might be possible to eliminate rows in a normalized design and this would save space. I don't know if MySQL has support for a sparse table concept, which could come into play there.
You have a 140 data items to return every time, each of type double.
It makes no practical difference whether this is 1x140 or 20x7 or 7x20 or 4x35 etc. It could be infinitesimally quicker for one shape of course but then have you considered the extra complexity in the PHP code to deal with a different shape.
Do you have a verified bottleneck, or is this just random premature optimisation?
You've made no suggestion that you intend to store big data in the database, but for the purposes of this argument, I will assume that you have 1 billion (10^9) data points.
If you store them in 140 columns, you'll have a mere 7 millon rows, however, if you want to retrieve a single data point from lots of experiments, then it will have to fetch a large number of very wide rows.
These very wide rows will take up more space in your innodb_buffer_pool, hence you won't be able to cache so many; this will potentially slow you down when you access them again.
If you store one datapoint per row, in a table with very few columns (experiment_id, datapoint_id, value) then you'll need to pull out the same number of smaller rows.
However, the size of rows makes little difference to the number of IO operations required. If we assume that your 1 billion datapoints doesn't fit in ram (which is NOT a safe assumption nowadays), maybe the resulting performance will be approximately the same.
It is probably better database design to use few columns; but it will use less disc space and perhaps be faster to populate, if you use lots of columns.

Handling 100's of 1,000,000's of rows in T-SQL2005

I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.
How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.
I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)
Any and all help, ideas, suggestions very, deeply, much appreciated,
Matt.
Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.
Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.
This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.
On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.
Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like
SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')
This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.
What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.
By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:
Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.
The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).
You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.
It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?
If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.
With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.
If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.
EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.