SQL Data Normalisation / Performance

SQL Data Normalisation / Performance - sql

I am working on a web API for the insurance industry and trying to work out a suitable data structure for the quoting of insurance.
The database already contains a "ratings" table which is basically:
sysID (PK, INT IDENTITY)
goods_type (VARCHAR(16))
suminsured_min (DECIMAL(9,2))
suminsured_max (DECIMAL(9,2))
percent_premium (DECIMAL(9,6))
[Unique Index on goods_type, suminsured_min and suminsured_max]
[edit]
Each type of goods typically has 3 - 4 ranges for suminsured
[/edit]
The list of goods_types rarely changes and most queries for insurance will involve goods worth less than $100. Because of this, I was considering de-normalising using tables in the following format (for all values from $0.00 through to $100.00):
Table Name: tblRates[goodstype]
suminsured (DECIMAL(9,2)) Primary Key
premium (DECIMAL(9,2))
Denormalising this data should be easy to maintain as the rates are generally only updated once per month at most. All requests for values >$100 will always be looked up in the primary tables and calculated.
My question(s) are:
1. Am I better off storing the suminsured values as DECIMAL(9,2) or as a value in cents stored in a BIGINT?
2. This de-normalisation method involves storing 10,001 values ($0.00 to $100.00 in $0.01 increments) in possibly 20 tables. Is this likely to be more efficient than looking up the percent_premium and performing a calculation? - Or should I stick with the main tables and do the calculation?

Don't create new tables. You already have an index on goods, min and max values, so this sql for (known goods and its value):
SELECT percent_premium
FROM ratings
WHERE goods='PRECIOUST' and :PREC_VALUE BETWEEN suminsured_min AND suminsured_max
will use your index efficently.
The data type you are looking for is smallmoney. Use it.

The plan you suggest will use a binary search on 10001 rows instead of 3 or 4.
It's hardly a performance improvement, don't do that.
As for arithmetics, BIGINT will be slightly faster, thought I think you will hardly notice that.

i am not entirely sure exactly what calculations we are talking about, but unless they are obnoxiously complicated, they will more than likely be much quicker than looking up data in several different tables. if possible, perform the calculations in the db (i.e. use stored procedures) to minimize the data traffic between your application layers too.
and even if the data loading would be quicker, i think the idea of having to update de-normalized data as often as once a month (or even once a quarter) is pretty scary. you can probably do the job pretty quickly, but what about the next person handling the system? would you require of them to learn the db structure, remember which of the 20-some tables that need to be updated each time, and do it correctly? i would say the possible performance gain on de-normalizing will not be worth much to the risk of contaminating the data with incorrect information.

Related

SQL Server Time Series Modelling Huge datacollection

I have to implement data collection for replay for electrical parameters for 100-1000's of devices with at least 20 parameters to monitor. This amounts to huge data collection as it will be based very similar to time series.I have to support resolution for 1 second. thinking about 1 year [365*24*60*60*1000]=31536000000 rows.
I did my research but still have few questions
As data will be huge is it good to keep data in same table or should the tables be spitted. [data structure is same] or i should
rely on indexes?
Data inserts also will be very frequent but i can batch them still what is the best way? Is it directly writing to same database
or using a temporary database for write and sync with it?
Does SQL Server has a specific schema recommendation to do time series optimization for select,update and inserts? any out of box
helps for day average ? or specific common aggregate functions i can
write my own but just to know as this a standard problem so they
might have some best practices and samples out of box.**
please let me know any help is appreciated, thanks in advance

1) You probably want to explore the use of partitions. This will allow very effective inserts (its a meta operation if you do the partitioning correctly) and very fast (2). You may want to explore columnstore indexes because the data (once collected) will never change and you will have very large data sets. Partitioning and columnstore require a learning curve but its very doable. There are lots of code on the internet describing the use of date functions in SQL Server.

That is a big number but I would start with one table see if it hold up. If you split it in multiple tables it is still the same amount of data.
Do you ever need to search across devices? If not you can have a separate table for each device.
I have some audit tables that are not that big but still big and have not had any problems. If the data is loaded in time order then make date the first (or only) column of the clustered index.
If the the PK is date, device then fine but if you can get two reading in the same seconds you cannot do that. If this is the PK then if you can load the data by that sort. Even if you have to stage each second and load. You just cannot afford to fragment a table that big. If you cannot load by the sort then take a fill factor of 50%.
If you cannot have a PK then just use date as clustered index but not as PK and put a non clustered index on device.
I have some tables of 3,000,000,000 and I have the luxury of loading by PK with no other indexes. There is no measurable degradation in insert from row 1 to row 3,000,000,000.

audit table vs. Type 2 Slowly Changing Dimension

In SQL Server 2008+, we'd like to enable tracking of historical changes to a "Customers" table in an operational database.
It's a new table and our app controls all writing to the database, so we don't need evil hacks like triggers. Instead we will build the change tracking into our business object layer, but we need to figure out the right database schema to use.
The number of rows will be under 100,000 and number of changes per record will average 1.5 per year.
There are at least two ways we've been looking at modelling this:
As a Type 2 Slowly Changing Dimension table called CustomersHistory, with columns for EffectiveStartDate, EffectiveEndDate (set to NULL for the current version of the customer), and auditing columns like ChangeReason and ChangedByUsername. Then we'd build a Customers view over that table which is filtered to EffectiveEndDate=NULL. Most parts of our app would query using that view, and only parts that need to be history-aware would query the underlying table. For performance, we could materialize the view and/or add a filtered index on EffectiveEndDate=NULL.
With a separate audit table. Every change to a Customer record writes once to the Customer table and again to a CustomerHistory audit table.
From a quick review of StackOverflow questions, #2 seems to be much more popular. But is this because most DB apps have to deal with legacy and rogue writers?
Given that we're starting from a blank slate, what are pros and cons of either approach? Which would you recommend?

In general, the issue with SCD Type- II is, if the average number of changes in the values of the attributes are very high, you end-up having a very fat dimension table. This growing dimension table joined with a huge fact table slows down the query performance gradually. It's like slow-poisoning.. Initially you don't see the impact. When you realize it, it's too late!
Now I understand that you will create a separate materialized view with EffectiveEndDate = NULL and that will be used in most of your joins. Additionally for you, the data volume is comparatively low (100,000). With average changes of only 1.5 per year, I don't think data volume / query performance etc. are going to be your problem in the near future.
In other words, your table is truly a slowly changing dimension (as opposed to a rapidly changing dimension - where your option #2 is a better fit). In your case, I will prefer option #1.

Join or storing directly

I have a table A which contains entries I am regularly processing and storing the result in table B. Now I want to determine for each entry in A its latest processing date in B.
My current implementation is joining both tables and retrieving the latest date. However an alternative, maybe less flexible, approach would be to simply store the date in table A directly.
I can think of pros and cons for both cases (performance, scalability, ....), but didnt have such a case yet and would like to see whether someone here on stackoverflow had a similar situation and has a recommendation for either one for a specific reason.
Below a quick schema design.
Table A
id, some-data, [possibly-here-last-process-date]
Table B
fk-for-A, data, date
Thanks

Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.
I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.
I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.

This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).
You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.
On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.
So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.
Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.

We had a similar situation in our project tracking system where the latest state of the project is stored in the projects table (Cols: project_id, description etc.,) and the history of the project is stored in the project_history table (Cols: project_id, update_id, description etc.,). Whenever there is a new update to the project, we need find out the latest update number and add 1 to it to get the sequence number for the next update. We could have done this by grouping the project_history table on the project_id column and get the MAX(update_id), but the cost would be high considering the number of the project updates (in a couple of hundreds of thousands) and the frequency of update. So, we decided to store the value in the projects table itself in max_update_id column and keep updating it whenever there is a new update to a given project. HTH.

If I understand correctly, you have a table whose each row is a parameter and another table that logs each parameter value historically in a time series. If that is correct, I currently have the same situation in one of the products I am building. My parameter table hosts a listing of measures (29K recs) and the historical parameter value table has the value for that parameter every 1 hr - so that table currently has 4M rows. At any given point in time there will be a lot more requests FOR THE LATEST VALUE than for the history so I DO HAVE THE LATEST VALUE STORED IN THE PARAMETER TABLE in addition to it being in the last record in the parameter value table. While this may look like duplication of data, from the performance standpoint it makes perfect sense because
To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table
So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.

Handling 100's of 1,000,000's of rows in T-SQL2005

I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.
How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.
I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)
Any and all help, ideas, suggestions very, deeply, much appreciated,
Matt.

Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.
Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.
This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.
On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.
Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like
SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')
This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.

What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.
By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:
Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.
The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).
You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.
It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?
If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.

With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.
If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.
EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.

C# with SQL database schema question

Is it acceptable to dynamically generate the total of the contents of a field using up to 10k records instead of storing the total in a table?
I have some reasons to prefer on-demand generation of a total, but how bad is the performance price on an average home PC? (There would be some joins -ORM managed- involved in figuring the total.)
Let me know if I'm leaving out any info important to deciding the answer.
EDIT: This is a stand-alone program on a user's PC.

If you have appropriate indexing in place, it won't be too bad to do on demand calculations. The reason that I mention indexing is that you haven't specified whether the total is on all the values in a column, or on a subset - if it's a subset, then the fields that make up the filter may need to be indexed, so as to avoid table scans.

Usually it is totally acceptable and even recommended to recalculate values. If you start storing calculated values, you'll face some overhead ensuring that they are always up to date, usually using triggers.
That said, if your specific calculation query turns out to take a lot of time, you might need to go that route, but only do that if you actually hit a performance problem, not upfront.

Using a Sql query you can quickly and inexpensively get the total number of records using the max function.
It is better to generate the total then keep it as a record, the same way as you would keep a persons birth date and determine their age then keep their age.

How offten and by what number of users u must get this total value, how offten data on which total depends are updated.
Maybe only thing you need is to make this big query once a day (or once at all) and save it somewhere in db and then update it when data, on which your total consist, are changed

You "could" calculate the total with SQL (I am assuming you do not want total number of records ... the price total or whatever it is). SQL is quite good at mathematics when it gets told to do so :) No storing of total.
But, as it is all run on the client machine, I think my preference would be to total using C#. Then the business rules for calculating the total are out of the DB/SQL. By that I mean if you had a complex calculation for total that reuired adding say 5% to orders below £50 and the "business" changed it to add 10% to orders below £50 it is done in your "business logic" code rather than in your storage medium (in this case SQL).
Kindness,
Dan

I think that it should not take long, probably less than a second, to generate a sum from 8000-10000 records. Even on a single PC the query plan for this query should be dominated by a single table scan, which will generate mostly sequential I/O.
Proper indexing should make any joins reasonably efficient unless the schema is deeply flawed and unless you have (say) large blob fields in the table the total data volume for the rows should not be very large at all. If you still have performance issues going through an O/R mapper, consider re-casting the functionality as a report where you can control the SQL.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas