SQL Table handle large amounts of records - sql

I need to make sure that a table of mine can handle in excess of 1,000,000 records.
Can I have some advice on my table code to determine if it can indeed handle this amount of records.
Here is my code:
USE [db_person_cdtest]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [Person](
[PersonID] [numeric](18, 0) IDENTITY(1,1) NOT NULL,
[ID] [varchar](20),
[FirstName] [varchar](50) NOT NULL,
[LastName] [varchar](50) NOT NULL,
[AddressLine1] [varchar](50),
[AddressLine2] [varchar](50),
[AddressLine3] [varchar](50),
[MobilePhone] [varchar](20),
[HomePhone] [varchar](20),
[Description] [varchar](10),
[DateModified] [datetime],
[PersonCategory] [varchar](30) NOT NULL,
[Comment] [varchar](max),
CONSTRAINT [PK_Person] PRIMARY KEY CLUSTERED
(
[PersonID] DESC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY];

Almost any table structure in almost any database can handle a million records. That is not a large number of records for a modern computer running modern software.
Your structure looks reasonable. One question is whether the fields are always large enough to hold the value in the data. It looks like you are using SQL Server. There is no difference in storage or performance to declaring a varchar(50) versus a varchar(8000). "50" seems on the low side to me.
Another comment is that you have a DateModified column. I would suggest that you also keep a history table of the modifications. It is often important to know what changed, when it changed, and what the values were before the change.
In more advanced systems, you would not be storing a person's address and telephone number in the same table as their unique ids. A person could have more than one address (shipping address, billing address, home address, etc.). A person could have many telephone numbers (landline number, mobile number, work number, work mobile, etc.). And, you have no fields for email address, Facebook id, and so on. Contact information is more complex than a few fields in a table.
Finally, as a matter of habit, I almost always include the following fields at the end of every table:
CreatedBy varchar(255) default system_user,
CreataedAt datetime not null default getdate()
This let's me know who and when a row was created.

Related

Store hourly data efficient way

There is a requirement to store hourly data in SQL Server 2016 and retrieve. It's an OLTP database.
I will explain with an example: we need to capture temperature of each city of a country and store on hourly basis. What would be the best and efficient design to do this. The data would be stored for a year and then archived
This is my plan. Can some one review and let me know if this approach is fine?
CREATE TABLE [dbo].[CityMaster]
(
[CityId] [int] NULL,
[CityName] [varchar](300) NULL
) ON [PRIMARY]
--OPTION 1
CREATE TABLE [dbo].[WeatherData]
(
[Id] [bigint] NULL,
[CityId] [int] NULL,
[HrlyTemp] [decimal](18, 1) NULL,
[CapturedTIme] [datetime] NULL
) ON [PRIMARY]
GO
--OPTION2
CREATE TABLE [dbo].[WeatherData_JSon]
(
[Id] [bigint] NULL,
[CityId] [int] NULL,
[Month] [varchar](50) NULL,
[Hrlytemp] [nvarchar](max) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
To some extent, this depends on how the data is going to be used. The most natural solution is to tweak the first option and use a partitioned table:
CREATE TABLE [dbo].CityHourlyTemperatures (
Id bigint NULL,
CityId int NULL,
HrlyTemp decimal(6, 1) NULL,
CapturedTIme datetime NULL
) ;
Note I changed the name to something that seems to better capture the name.
Even with global warming, I think that 4 or 5 digits of precision in the temperature is quite sufficient -- 18 is way overkill.
Each row here has 8 + 4 + 5 + 8 bytes = 25 bytes (it may be rounded up if there are alignment restrictions). A year has about 8,766 hours. So, if you have 100 cities, this is less than a million rows per year and just a few tens of megabytes per year.
That is quite feasible, but you might want to consider partitioning the table -- the older partitions can act like an "archive".
Your second option stores the temperatures as a blob. This would make sense under only one circumstance: you don't care about the temperatures but you need to return the data to an application that does.
The implication from the name is that you want to store the value as JSON. This usually requires more overhead than storing the data using native types -- and is often less efficient. JSON is very powerful and useful for some types of data, particularly sparse data. However, your data is quite relational and can be stored in a relational format.
If you wanted to save space, you could consider the following:
Replacing the datetime value with an hourId column. This could possibly be a shortint if you only want a few years of data.
Removing the id column and defining the cityid/hourid as the primary key.
However, your volume of data does not seem to suggest that such an approach is necessary.
Option 2 is not feasible.
Option 1 is better.
I think HrlyTemp column's size should be [decimal](4, 1) or [decimal](5, 1) max.
If you deal with all the world's cities data, based on an approximation of total cities in the world ~1000.
Then you need to store 365*24*1000 = 8,760,000 ~ 9M rows per year. To stay on the safe side, we can assume that we have to store 10M data.
Which is OK for the SQL Server.
CREATE TABLE [dbo].[WeatherData]
(
[Id] [bigint] NULL,
[CityId] [int] NULL,
[HrlyTemp] [decimal](18, 1) NULL,
[CapturedTIme] [datetime] NULL,
[DailyTempRepo] nvarchar(max) * record create and update per day
) ON [PRIMARY]
GO
* you can normalize more your table.
how to store data in DailyTempRepo column as json :
[{"TempDate":"2021-03-06","CityId":"2","CapturedTIme":"09:30","HrlyTemp":"70"},
{"TempDate":"2021-03-06","CityId":"2","CapturedTIme":"10:30","HrlyTemp":"78"},
{"TempDate":"2021-03-06","CityId":"2","CapturedTIme":"11:30","HrlyTemp":"81"}]

Dealing with huge table - 100M+ rows

I have table with around 100 million rows and it is only getting larger, as table is queried pretty frequently I have to come up with some solution to optimise this.
Firstly here is the model:
CREATE TABLE [dbo].[TreningExercises](
[TreningExerciseId] [uniqueidentifier] NOT NULL,
[NumberOfRepsForExercise] [int] NOT NULL,
[CycleNumber] [int] NOT NULL,
[TreningId] [uniqueidentifier] NOT NULL,
[ExerciseId] [int] NOT NULL,
[RoutineExerciseId] [uniqueidentifier] NULL)
Here is Trening table:
CREATE TABLE [dbo].[Trenings](
[TreningId] [uniqueidentifier] NOT NULL,
[DateTimeWhenTreningCreated] [datetime] NOT NULL,
[Score] [int] NOT NULL,
[NumberOfFinishedCycles] [int] NOT NULL,
[PercentageOfCompleteness] [int] NOT NULL,
[IsFake] [bit] NOT NULL,
[IsPrivate] [bit] NOT NULL,
[UserId] [nvarchar](128) NOT NULL,
[AllRoutinesId] [bigint] NOT NULL,
[Name] [nvarchar](max) NULL,
)
Indexes (other than PK which are clustered):
TreningExercises:
TreningId (also FK)
ExerciseId (also FK)
Trenings:
UserId (also FK)
AllRoutinesId (also FK)
Score
DateTimeWhenTreningCreated (ordered by DateTimeWhenTreningCreated DESC)
And here is the example of the most commonly executed query:
DECLARE #userId VARCHAR(40)
,#exerciseId INT;
SELECT TOP (1) R.[TreningExerciseId] AS [TreningExerciseId]
,R.[NumberOfRepsForExercise] AS [NumberOfRepsForExercise]
,R.[TreningId] AS [TreningId]
,R.[ExerciseId] AS [ExerciseId]
,R.[RoutineExerciseId] AS [RoutineExerciseId]
,R.[DateTimeWhenTreningCreated] AS [DateTimeWhenTreningCreated]
FROM (
SELECT TE.[TreningExerciseId] AS [TreningExerciseId]
,TE.[NumberOfRepsForExercise] AS [NumberOfRepsForExercise]
,TE.[TreningId] AS [TreningId]
,TE.[ExerciseId] AS [ExerciseId]
,TE.[RoutineExerciseId] AS [RoutineExerciseId]
,T.[DateTimeWhenTreningCreated] AS [DateTimeWhenTreningCreated]
FROM [dbo].[TreningExercises] AS TE
INNER JOIN [dbo].[Trenings] AS T ON TE.[TreningId] = T.[TreningId]
WHERE (T.[UserId] = #userId)
AND (TE.[ExerciseId] = #exerciseId)
) AS R
ORDER BY R.[DateTimeWhenTreningCreated] DESC
Execution plan: link
Please accept my apologies if it is bit unreadable or unoptimised, it was generated by ORM (Entity Framework), I just edited it a bit.
According to Azure's SQL Analytics tool this query has the most impact on my DB and even though it usually doesn't take too long to execute, from time to time there are spikes in DB I/O due to it.
Also there is a bit business logic involved in this, to simplify it: 99% of the time I need data which is less then a year old.
What are my best options regarding querying and table size?
My thoughts on querying, either:
Create indexed view OR
Add Date and UserId fields to the TreningExerciseId table OR
Some option that I haven't thought of :)
Regarding table size, either:
Partition table (probably by date) OR
Move most of the data (or all of it) to some NoSQL key-value store OR
Some option that I haven't thought of :)
What are your thoughts about these problems, how should I approach solving them?
If you add the following columns to the index "ix_TreninID":
NoOfRepsForExecercise
ExerciseID
RoutineExerciseID
That will make the index a "covering index" and eliminate the need for the lookup which is taking 95% of the plan.
Give it a go, and post back.

I have a GUID Clustered primary key - Is there a way I can optimize or unfragment a table that might be fragmented?

Here's the code I have. The table actually has 20 more columns but I am just showing the first few:
CREATE TABLE [dbo].[Phrase]
(
[PhraseId] [uniqueidentifier] NOT NULL,
[PhraseNum] [int] NULL
[English] [nvarchar](250) NOT NULL,
PRIMARY KEY CLUSTERED ([PhraseId] ASC)
) ON [PRIMARY]
GO
From what I remember I read
Fragmentation and GUID clustered key
that it was good to have a GUID for the primary key but now it's been suggested it's not a good idea as data has to be re-ordered for each insert -- causing fragmentation.
Can anyone comment on this. Now my table has already been created is there a way to unfragment it? Also how can I stop this problem getting worse. Can I modify an existing table add NEWSEQUENTIALID?
Thats true ,NEWSEQUENTIALID helps to completely fill the data and index pages.
But NEWSEQUENTIALID datasize is 4 times than int.So 4 times more page will be require than int.
declare #t table(col int
,col2 uniqueidentifier DEFAULT NEWSEQUENTIALID())
insert into #t (col) values(1),(2)
select DATALENGTH(col2),DATALENGTH(col) from #t
Suppose x data page is require in case of int to hold 100 rows
In case of NEWSEQUENTIALID 4x data page will be require to hold 100 rows.
Therefore query will read more page to fetch same number of records.
So ,if you can alter table then you can add int identity column and make it PK+CI.You can drop or not [uniqueidentifier] as per your requirement or need.
Looks like this is dup to:
INT vs Unique-Identifier for ID field in database
But here's a rehash for your issue:
Rather than a guid and depending on your table depth, int or big int would be better choices, both from storage and optimization vantages. You might also consider defining the field as "int identity not null" to further help population.
GUIDs have a considerable storage impact, due to their length.
CREATE TABLE [dbo].[Phrase]
(
[PhraseId] [int] identity NOT NULL
CONSTRAINT [PK_Phrase_PhraseId] PRIMARY KEY,
[PhraseNum] [int] NULL
[English] [nvarchar](250) NOT NULL,
....
) ON [PRIMARY]
GO

SQL Store varchar or int when I have only a set of values

I would like to know if it is better (subjective i know) to store an integer of values or a string of values when the field only has a set of possible values. E.g.
Person Table
1.
Name Age Category
Joe 25 0
Jane 28 2
John 22 1
2.
Name Age Category
Joe 25 Student
Jane 28 Teacher
John 22 Staff
Which method is advisable? Method 1 is probably faster and better for querying, however, there is more programming cost when displaying data.
Method 2 is probably slower, more expressive and less programming cost.
Any advise will be useful.
Thanks in advance
You would generally do this using a reference table, with the category, and an integer for linking the tables.
A reference table has multiple advantages:
The list of possible values is available in one place. This is handy, for instance, for generating a list in an application.
There are no misspellings.
You can store additional information, such as a short name, a long description, honorific, etc.
If you need multi-lingual support, you have all the values in a single place.
The same values can be shared across multiple tables.
Sometimes, a reference table isn't appropriate. For instance, you might have just two values, ON and OFF. You can validate the values using a CHECK CONSTRAINT in most databases. That is a reasonable alternative. But I suspect that the category has more information than just a handful of values.
May be you are looking for a simple respons but, I woud like share my method,
I have a Table named CustomCaptions.
You can find here the structure of the table.
CREATE TABLE [dbo].[CustomCaptions](
[Capt_ID] [int] IDENTITY(1,1) NOT NULL,
[Capt_Code] [varchar](50) NULL,
[Capt_Family] [varchar](50) NULL,
[Capt_FR] [nvarchar](100) NULL,
[Capt_EN] [nvarchar](100) NULL,
[Capt_ES] [nvarchar](100) NULL,
[Capt_IT] [nvarchar](100) NULL,
[Capt_TR] [nvarchar](100) NULL,
[Capt_CS] [nvarchar](100) NULL,
[Capt_DE] [nvarchar](100) NULL,
[Capt_Deleted] [bit] NULL,
[Capt_Order] [smallint] NULL,
CONSTRAINT [PK_CustomCaptions] PRIMARY KEY CLUSTERED
(
[Capt_ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The Capt_Family column is the name of your foreign column.
In your case, in the CustomCaptions Table, I keep the code and the family.
For exemple,
Capt_Family Capt_Code Capt_EN ... Capt_Order
Category 0 Student 0
Category 1 Teacher 1
Category 2 Staff 2
Such, for all little tables like category, status, type etc... I use only one table, this reduces the total count of tables in my db.
Also, I have only one methode to fill comboboxes or listboxes by giving only the family name. And also only one screen to edit the content of any list.
And also as you can see in the table structure, you can manage multi language easily in your application.
Depending your need, you can use Capt_EN or another column for another language.
Finaly, if you wish, you can create views which will reduce the programming cost.
I hope this helps.

Insert only modified values and column names into a table

I have a sql server 2012 database. In which i have a changeLog table that contains
TableName, ColumnName, FromValue and ToValue columns. Which will be used to keep track of modified columns and data.
So if any update occur through application then only modified columns should insert into this table with its new and old value.
Can anyone help me in this.
For Example:
If the procedure updates all columns of property table (propertyName, address)
then if user update propertyName (but update also contains address column but with no data change) then only propertyName and its data will be inserted into ChangeLog table not address column and its data because address data does not contains any data change.
IF there is no other auditing requirement at all - you would not be thinking about Auditing in any way without this - then OK, go for it. However this is a very limited use of Auditing: User X changed this field at time Y. Generally this is interesting as part of a wider question: what did user X do? What happened to that customer data in the database to end up the way it is now?
Questions like that are harder to answer if you have the data structure you propose and would be quite onerous to reconstruct. My usual approach would be as follows. Starting from a base table like so (this from one of my current projects):
CREATE TABLE [de].[Generation](
[Id] [int] IDENTITY(1,1) NOT NULL,
[LocalTime] [datetime] NOT NULL,
[EntityId] [int] NOT NULL,
[Generation] [decimal](18, 4) NOT NULL,
[UpdatedAt] [datetime] NOT NULL CONSTRAINT [DF_Generation_UpdatedAt] DEFAULT (getdate()),
CONSTRAINT [PK_Generation] PRIMARY KEY CLUSTERED
(
[Id] ASC
)
(I've excluded FK definitions as they aren't relevant here.)
First create an Audit table for this table:
CREATE TABLE [de].[GenerationAudit](
[AuditId] int identity(1, 1) not null,
[Id] [int] NOT NULL,
[LocalTimeOld] [datetime] NULL,
[EntityIdOld] [int] NULL,
[GenerationOld] [decimal](18, 4) null,
[UpdatedAtOld] [datetime] null,
[LocalTimeNew] [datetime] null,
[EntityIdNew] [int] null,
[GenerationNew] [decimal](18, 4) null,
[UpdatedAtNew] [datetime] NOT NULL CONSTRAINT [DF_GenerationAudit_UpdatedAt] DEFAULT (getdate()),
[UpdatedBy] varchar(60) not null
CONSTRAINT [PK_GenerationAudit] PRIMARY KEY CLUSTERED
(
[AuditId] ASC
)
This table has an *Old and a *New version of each column that can't change. The Id, being an IDENTITY PK, can't change so no need for an old/new. I've also added an UpdatedBy column. It also has a new AuditId IDENTITY PK.
Next create three triggers on the base table: one for INSERT, one for UPDATE and one for DELETE. In the Insert trigger, insert a row into the Audit table with the New columns selected from the inserted table and the Old values as null. In the UPDATE one, the Oldvalues come from the deleted and the new from the inserted. In the DELETE trigger, old from from deleted and the new are all null.
The UPDATE trigger would look like this:
CREATE TRIGGER GenerationAuditUpdate
ON de.Generation
AFTER UPDATE
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
insert into de.GenerationAudit (Id, LocalTimeOld, EntityIdOld, GenerationOld, UpdatedAtOld,
LocalTimeNew, EntityIdNew, GenerationNew, UpdatedAtNew,
UpdatedBy)
select isnull(i.Id, d.Id), d.LocalTime, d.EntityId, d.Generation, d.UpdatedAt,
i.LocalTime, i.EntityId, d.Generation, getdate(),
SYSTEM_USER)
from inserted i
full outer join deleted d on d.Id = i.Id;
END
GO
You then have a full before/after picture of each change (and it'll be faster than seperating out diffs column by column). You can create views over the Audit table to get entries where the Old value is different to the new, and include the base table Id (which you will also need in your structures!), the user who did it, and the time they did it (UpdatedAtNew).
That's my version of Auditing and it's mine!