Dealing with huge table - 100M+ rows

Dealing with huge table - 100M+ rows - sql

I have table with around 100 million rows and it is only getting larger, as table is queried pretty frequently I have to come up with some solution to optimise this.
Firstly here is the model:
CREATE TABLE [dbo].[TreningExercises](
[TreningExerciseId] [uniqueidentifier] NOT NULL,
[NumberOfRepsForExercise] [int] NOT NULL,
[CycleNumber] [int] NOT NULL,
[TreningId] [uniqueidentifier] NOT NULL,
[ExerciseId] [int] NOT NULL,
[RoutineExerciseId] [uniqueidentifier] NULL)
Here is Trening table:
CREATE TABLE [dbo].[Trenings](
[TreningId] [uniqueidentifier] NOT NULL,
[DateTimeWhenTreningCreated] [datetime] NOT NULL,
[Score] [int] NOT NULL,
[NumberOfFinishedCycles] [int] NOT NULL,
[PercentageOfCompleteness] [int] NOT NULL,
[IsFake] [bit] NOT NULL,
[IsPrivate] [bit] NOT NULL,
[UserId] [nvarchar](128) NOT NULL,
[AllRoutinesId] [bigint] NOT NULL,
[Name] [nvarchar](max) NULL,
)
Indexes (other than PK which are clustered):
TreningExercises:
TreningId (also FK)
ExerciseId (also FK)
Trenings:
UserId (also FK)
AllRoutinesId (also FK)
Score
DateTimeWhenTreningCreated (ordered by DateTimeWhenTreningCreated DESC)
And here is the example of the most commonly executed query:
DECLARE #userId VARCHAR(40)
,#exerciseId INT;
SELECT TOP (1) R.[TreningExerciseId] AS [TreningExerciseId]
,R.[NumberOfRepsForExercise] AS [NumberOfRepsForExercise]
,R.[TreningId] AS [TreningId]
,R.[ExerciseId] AS [ExerciseId]
,R.[RoutineExerciseId] AS [RoutineExerciseId]
,R.[DateTimeWhenTreningCreated] AS [DateTimeWhenTreningCreated]
FROM (
SELECT TE.[TreningExerciseId] AS [TreningExerciseId]
,TE.[NumberOfRepsForExercise] AS [NumberOfRepsForExercise]
,TE.[TreningId] AS [TreningId]
,TE.[ExerciseId] AS [ExerciseId]
,TE.[RoutineExerciseId] AS [RoutineExerciseId]
,T.[DateTimeWhenTreningCreated] AS [DateTimeWhenTreningCreated]
FROM [dbo].[TreningExercises] AS TE
INNER JOIN [dbo].[Trenings] AS T ON TE.[TreningId] = T.[TreningId]
WHERE (T.[UserId] = #userId)
AND (TE.[ExerciseId] = #exerciseId)
) AS R
ORDER BY R.[DateTimeWhenTreningCreated] DESC
Execution plan: link
Please accept my apologies if it is bit unreadable or unoptimised, it was generated by ORM (Entity Framework), I just edited it a bit.
According to Azure's SQL Analytics tool this query has the most impact on my DB and even though it usually doesn't take too long to execute, from time to time there are spikes in DB I/O due to it.
Also there is a bit business logic involved in this, to simplify it: 99% of the time I need data which is less then a year old.
What are my best options regarding querying and table size?
My thoughts on querying, either:
Create indexed view OR
Add Date and UserId fields to the TreningExerciseId table OR
Some option that I haven't thought of :)
Regarding table size, either:
Partition table (probably by date) OR
Move most of the data (or all of it) to some NoSQL key-value store OR
Some option that I haven't thought of :)
What are your thoughts about these problems, how should I approach solving them?

If you add the following columns to the index "ix_TreninID":
NoOfRepsForExecercise
ExerciseID
RoutineExerciseID
That will make the index a "covering index" and eliminate the need for the lookup which is taking 95% of the plan.
Give it a go, and post back.

Related

Store hourly data efficient way

There is a requirement to store hourly data in SQL Server 2016 and retrieve. It's an OLTP database.
I will explain with an example: we need to capture temperature of each city of a country and store on hourly basis. What would be the best and efficient design to do this. The data would be stored for a year and then archived
This is my plan. Can some one review and let me know if this approach is fine?
CREATE TABLE [dbo].[CityMaster]
(
[CityId] [int] NULL,
[CityName] [varchar](300) NULL
) ON [PRIMARY]
--OPTION 1
CREATE TABLE [dbo].[WeatherData]
(
[Id] [bigint] NULL,
[CityId] [int] NULL,
[HrlyTemp] [decimal](18, 1) NULL,
[CapturedTIme] [datetime] NULL
) ON [PRIMARY]
GO
--OPTION2
CREATE TABLE [dbo].[WeatherData_JSon]
(
[Id] [bigint] NULL,
[CityId] [int] NULL,
[Month] [varchar](50) NULL,
[Hrlytemp] [nvarchar](max) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

To some extent, this depends on how the data is going to be used. The most natural solution is to tweak the first option and use a partitioned table:
CREATE TABLE [dbo].CityHourlyTemperatures (
Id bigint NULL,
CityId int NULL,
HrlyTemp decimal(6, 1) NULL,
CapturedTIme datetime NULL
) ;
Note I changed the name to something that seems to better capture the name.
Even with global warming, I think that 4 or 5 digits of precision in the temperature is quite sufficient -- 18 is way overkill.
Each row here has 8 + 4 + 5 + 8 bytes = 25 bytes (it may be rounded up if there are alignment restrictions). A year has about 8,766 hours. So, if you have 100 cities, this is less than a million rows per year and just a few tens of megabytes per year.
That is quite feasible, but you might want to consider partitioning the table -- the older partitions can act like an "archive".
Your second option stores the temperatures as a blob. This would make sense under only one circumstance: you don't care about the temperatures but you need to return the data to an application that does.
The implication from the name is that you want to store the value as JSON. This usually requires more overhead than storing the data using native types -- and is often less efficient. JSON is very powerful and useful for some types of data, particularly sparse data. However, your data is quite relational and can be stored in a relational format.
If you wanted to save space, you could consider the following:
Replacing the datetime value with an hourId column. This could possibly be a shortint if you only want a few years of data.
Removing the id column and defining the cityid/hourid as the primary key.
However, your volume of data does not seem to suggest that such an approach is necessary.

Option 2 is not feasible.
Option 1 is better.
I think HrlyTemp column's size should be [decimal](4, 1) or [decimal](5, 1) max.
If you deal with all the world's cities data, based on an approximation of total cities in the world ~1000.
Then you need to store 365*24*1000 = 8,760,000 ~ 9M rows per year. To stay on the safe side, we can assume that we have to store 10M data.
Which is OK for the SQL Server.

CREATE TABLE [dbo].[WeatherData]
(
[Id] [bigint] NULL,
[CityId] [int] NULL,
[HrlyTemp] [decimal](18, 1) NULL,
[CapturedTIme] [datetime] NULL,
[DailyTempRepo] nvarchar(max) * record create and update per day
) ON [PRIMARY]
GO
* you can normalize more your table.
how to store data in DailyTempRepo column as json :
[{"TempDate":"2021-03-06","CityId":"2","CapturedTIme":"09:30","HrlyTemp":"70"},
{"TempDate":"2021-03-06","CityId":"2","CapturedTIme":"10:30","HrlyTemp":"78"},
{"TempDate":"2021-03-06","CityId":"2","CapturedTIme":"11:30","HrlyTemp":"81"}]

How to deal with large amount of XML data in a SQL Server database

In a table there are 10 columns and 2 of those columns store huge amounts of data. One column (XML datatype) stores XML data, another column (NVARCHAR(MAX)) stores JSON data. Each rows size around 1.4 MB. Moreover, when I am using a SELECT command it takes a lot of time (22 seconds) to load only one record
TABLE - [dbo].[CampaignConfiguration](
[CampaignId] [uniqueidentifier] NOT NULL,
[ConfigurationXml] [xml] NOT NULL,
[ConfigurationRules] [nvarchar](max) NOT NULL,
[RulesVersion] [nvarchar](50) NULL,
[Created] [datetime] NOT NULL,
[CampaignCalculationParameters] [nvarchar](max) NULL
Select *
From CampaignConfiguration
Where campaignid = 'EB5C2CDB-C076-5174-61D1-D9EA0E04975A';
Is there any better way to deal with to improve SELECT query performance?

SQL Table handle large amounts of records

I need to make sure that a table of mine can handle in excess of 1,000,000 records.
Can I have some advice on my table code to determine if it can indeed handle this amount of records.
Here is my code:
USE [db_person_cdtest]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [Person](
[PersonID] [numeric](18, 0) IDENTITY(1,1) NOT NULL,
[ID] [varchar](20),
[FirstName] [varchar](50) NOT NULL,
[LastName] [varchar](50) NOT NULL,
[AddressLine1] [varchar](50),
[AddressLine2] [varchar](50),
[AddressLine3] [varchar](50),
[MobilePhone] [varchar](20),
[HomePhone] [varchar](20),
[Description] [varchar](10),
[DateModified] [datetime],
[PersonCategory] [varchar](30) NOT NULL,
[Comment] [varchar](max),
CONSTRAINT [PK_Person] PRIMARY KEY CLUSTERED
(
[PersonID] DESC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY];

Almost any table structure in almost any database can handle a million records. That is not a large number of records for a modern computer running modern software.
Your structure looks reasonable. One question is whether the fields are always large enough to hold the value in the data. It looks like you are using SQL Server. There is no difference in storage or performance to declaring a varchar(50) versus a varchar(8000). "50" seems on the low side to me.
Another comment is that you have a DateModified column. I would suggest that you also keep a history table of the modifications. It is often important to know what changed, when it changed, and what the values were before the change.
In more advanced systems, you would not be storing a person's address and telephone number in the same table as their unique ids. A person could have more than one address (shipping address, billing address, home address, etc.). A person could have many telephone numbers (landline number, mobile number, work number, work mobile, etc.). And, you have no fields for email address, Facebook id, and so on. Contact information is more complex than a few fields in a table.
Finally, as a matter of habit, I almost always include the following fields at the end of every table:
CreatedBy varchar(255) default system_user,
CreataedAt datetime not null default getdate()
This let's me know who and when a row was created.

How to update 2nd table with identity value of inserted rows into 1st table

I have the following table structures
CREATE TABLE [dbo].[WorkItem](
[WorkItemId] [int] IDENTITY(1,1) NOT NULL,
[WorkItemTypeId] [int] NOT NULL,
[ActionDate] [datetime] NOT NULL,
[WorkItemStatusId] [int] NOT NULL,
[ClosedDate] [datetime] NULL,
)
CREATE TABLE [dbo].[RequirementWorkItem](
[WorkItemId] [int] NOT NULL,
[RequirementId] [int] NOT NULL,
)
CREATE TABLE #RequirmentWorkItems
(
RequirementId int,
WorkItemTypeId int,
WorkItemStatusId int,
ActionDate datetime
)
I use the #RequirmentWorkItems table to create workitems for requirements. I then need to INSERT the workitems into the WorkItem table and use the identity values from the WorkItem table to create the cross-reference rows in the RequirementWorkItem table.
Is there a way to do this without cursoring thru each row? And I can't put the RequirementId into the WorkItem table because depending on the WorkItemTypeId the WorkItem could be linked to a Requirement or a Notice or an Event.
So there are really 3 xref tables for WorkItems. Or would it be better to put a RequirementId, NoticeId and EventId in the WorkItem table and 1 of the columns would have a value and other 2 would be null? Hopefully all this makes sense. Thanks.

You should read MERGE and OUTPUT – the swiss army knife of T-SQL for more information about this.
Today I stumbled upon a different use for it, returning values using an OUTPUT clause from a table used as the source of data for an insertion. In other words, if I’m inserting from [tableA] into [tableB] then I may want some values from [tableA] after the fact, particularly if [tableB] has an identity. Observe my first attempt using a straight insertion where I am trying to get a field from #source.[id] that is not used in the insertion:

Creating trigger in SQL Server 2005 (has to work in 2008 too) to prevent duplicates?

I have table that I insert data with following query (from c# code):
INSERT INTO [BazaZarzadzanie].[dbo].[Wycena]
([KlienciPortfeleKontaID]
,[WycenaData]
,[WycenaTyp]
,[WycenaWartosc]
,[WycenaWaluta]
,[WycenaUzytkownik]
,[WycenaUzytkownikData])
VALUES
(#varKlienciPortfeleKontaID
,#varWycenaData
,#varWycenaTyp
,#varWycenaWartosc
,#varWycenaWaluta
,#varWycenaUzytkownik
,#varWycenaUzytkownikData)
Table creation script looks like this:
CREATE TABLE [dbo].[Wycena](
[KlienciPortfeleKontaID] [int] NULL,
[WycenaData] [datetime] NULL,
[WycenaTyp] [int] NULL,
[InID] [int] NULL,
[WycenaIlosc] [decimal](18, 2) NULL,
[WycenaCena] [decimal](18, 2) NULL,
[WycenaWartosc] [decimal](18, 2) NULL,
[WycenaWaluta] [nvarchar](3) NULL,
[WycenaUzytkownik] [nvarchar](50) NULL,
[WycenaUzytkownikData] [datetime] NULL
) ON [PRIMARY]
It also has couple of foreign keys but nothing that i could make primary/unique key. So i thought to prevent duplicates i would go for a trigger since to know one row is duplicate i actually have to test every single value of that row (well maybe not 2 last columns) ? This table has around 2mln rows.
Is this good idea? Or is there a better way?
Below is trigger I've created (not tested if it works):
CREATE TRIGGER [dbo].[trg_WycenaDuplicateCheck]
ON [dbo].[Wycena] FOR INSERT
AS
IF EXISTS(SELECT INSERTED.[KlienciPortfeleKontaID]
,INSERTED.[WycenaData]
,INSERTED.[WycenaTyp]
,INSERTED.[InID]
,INSERTED.[WycenaIlosc]
,INSERTED.[WycenaCena]
,INSERTED.[WycenaWartosc]
,INSERTED.[WycenaWaluta]
FROM INSERTED, Wycena
WHERE INSERTED.[KlienciPortfeleKontaID] = Wycena.[KlienciPortfeleKontaID]
AND INSERTED.[WycenaData] = Wycena.[WycenaData]
AND INSERTED.[WycenaTyp] = Wycena.[WycenaTyp]
AND INSERTED.[InID] = Wycena.[InID]
AND INSERTED.[WycenaIlosc] = Wycena.[WycenaIlosc]
AND INSERTED.[WycenaCena] = Wycena.[WycenaCena]
AND INSERTED.[WycenaWartosc] = Wycena.[WycenaWartosc]
AND INSERTED.[WycenaWaluta] = Wycena.[WycenaWaluta]
Group By INSERTED.[KlienciPortfeleKontaID]
,INSERTED.[WycenaData]
,INSERTED.[WycenaTyp]
,INSERTED.[InID]
,INSERTED.[WycenaIlosc]
,INSERTED.[WycenaCena]
,INSERTED.[WycenaWartosc]
,INSERTED.[WycenaWaluta]
HAVING COUNT (*) > 1)
BEGIN
RAISERROR('>>>DUPLICATES PREVENTED<<< ',10,1)
ROLLBACK TRAN
END

Create a "unique" index on the fields you care about.
CREATE UNIQUE INDEX IX_YOUR_FAVORITE_NAME
ON [dbo].[Wycena](... list of columns goes here ...)

Seems like you need to look at UNIQUE Constraints

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Dealing with huge table - 100M+ rows - sql

If you add the following columns to the index "ix_TreninID": NoOfRepsForExecercise ExerciseID RoutineExerciseID That will make the index a "covering index" and eliminate the need for the lookup which is taking 95% of the plan. Give it a go, and post back.

Related

Store hourly data efficient way

How to deal with large amount of XML data in a SQL Server database

SQL Table handle large amounts of records

How to update 2nd table with identity value of inserted rows into 1st table

Creating trigger in SQL Server 2005 (has to work in 2008 too) to prevent duplicates?

Categories

Resources