Best Way To Run Length Encode Data - sql

I've created an table that tracks the various attributes of objects over time.
Id | Attribute1 | Attribute2 | Attribute3 | StartDate | EndDate
------------------------------------------------------------------
01 | 100 | Null | Null | 2004-02-03 | 2006-04-30
01 | 100 | Null | D | 2006-05-01 | 2010-11-06
01 | 150 | Null | D | 2010-11-07 | Null
02 | 700 | 5600 | Null | 1998-09-27 | 2002-01-27
New data (~10s of thousands of records) come in each day. What I want to do is compare each record to the current data for that id, and then:
a) Do nothing if the attributes match.
b) If the attributes are different, update the current record so that the EndDate is the current date, and create a new record with the new attributes.
c) Create a new record if there isn't any data for that id.
My question is, what is the most efficient way to do this?
I can write a script that goes through each record, does the comparison, and the updates the table as appropriate, but I fell like this is brute-force, rather than an intelligent solution.
Would this be a good place to use a cursor?

How do you process data? As it comes in or in batch?
If it is as it comes in, then I would do a set of checks on the most likely attribute to change and to the least likely (just to optimize the checking a bit) and update as needed. 10's of thousands is not enough data to worry about slowing down too much. This is the straight forward approach.
If you process as a batch (like at end of business each day), sort the data by ID then descending end date. Delete all other instances of ID and only care about the latest one. No intermediary data would matter.
Example: you have 2 entries for id 1, one with endDate Jan 1 other with endDate Jan 25. Look at Jan 25 entry first and update if needed. Jan 1 entry is too old to care about at that point.

Related

Get the difference in time between multiple rows with the same column name

I need to get the time difference between two dates on different rows. This part is okay but I can have instances of the same title. A quick example which will explain things some more.
Lets say we have a table with the following records:
| ID | Title | Date |
| ----- | ------- |--------------------|
| 1 | Down |2021-03-07 12:05:00 |
| 2 | Up |2021-03-07 13:05:00 |
| 3 | Down |2021-03-07 10:30:00 |
| 4 | Up |2021-03-07 11:00:00 |
I basically need to get the time difference between the first "Down" and "Up". So ID 1 & 2 = 1 hour.
Then ID 3 & 4 = 30 mins, and so on for the amount of "Down" and "Up" rows there are.
(These will always be grouped together one after another)
It doesn't matter if the results are seperate or a SUM of all the differences.
I'm trying to get this done without a temp table.
Thank you.
This can be done using analytical functions, the availability of which will be determined based on your sql engine. The idea is to get the next value in the same row as the one you need to calculate the diff/sum
In the case above it would look some thing like below
SELECT
id ,
title,
Date as startdate,
LEAD(Date,1) OVER (
ORDER BY id
) enddate
FROM
table;
Once you have it on the same row, you can carry out your time difference operation.

How can I trigger an update to a value in a table when criteria is met on a different table?

Aware there is an almost identical question here, but that covers the SQL query required, rather than the mechanism of event triggering.
Lets say I have two tables. One table contains performance data for each staff member each week. The other table is a table that holds the staff members information. What I want is to update a value in the table to a Y or N based on whether that staff member left at the week date.
staffTable
+----------+----------------+------------+
| staff_id | staff_name | leave_date |
+----------+----------------+------------+
| 1 | Joseph Blogges | 2020-01-24 |
| 2 | Joe Bloggs | 9999-12-31 |
| 3 | Joey Blogz | 9999-12-31 |
+----------+----------------+------------+
targetTable
+------------+----------+--------+-----------+
| week_start | staff_id | target | left_flag |
+------------+----------+--------+-----------+
| 2020-01-13 | 1 | 10 | N |
| 2020-01-20 | 1 | 10 | N |
| 2020-01-27 | 1 | 8 | Y |
+------------+----------+--------+-----------+
What I am trying to do is have the left_flag automatically change from 'N' to 'Y' when the week_start value is greater than leave_date of the staff member (in the other table).
I have tried successfully putting this into a view, which works, but the problem is that existing applications, views and queries will need to all reference a new view instead of a table and I want to be able to query the data table as my front-end has issues interacting in live with a view instead of a table.
I have also successfully used a UDF to return the leave_date and then create computed column that will check if this UDF variable is greater than the start_date column and this worked fine until I realised that the UDF is the most resource consuming query on the entire server and is completely disproportionate.
Is there a way that I can trigger an update to the staffTable when a criteria is met in another table, or is there a totally better and different way of doing this? If it can't be done easily, I'll try to switch to a view and work around it in the front-end.
I'm going to describe the process rather than writing the code.
What you are describing can be accomplished using triggers on staffTable. When a new row is inserted or updated the trigger would change any rows in targetTable. This would be an after insert/update trigger.
The heart of the trigger would be:
update tt
set left_flag = 'Y'
from targettable tt join
inserted i
on tt.staff_id = i.staff_id
where i.leave_date < tt.week_start and
tt.left_flag <> 'Y';

Structuring Month-Based Data in SQL

I'm curious about what the best way to structure data in a SQL database where I need to keep track of certain fields and how they differ month to month.
For example, if I had a users table in which I was trying to store 3 different values: name, email, and how many times they've logged in each month. Would it be best practice to create a new column for each month and store the number of times they logged in that month under that column? Or would it be better to create a new row/table for each month?
My instinct says creating new columns is the best way to reduce redundancy, however I can see it getting a little unwieldy when the number of columns in the table changes over time. (I was also thinking that if I were to do it by column, it would warrant having a total_column that keeps track of all months at a time).
Thanks!
In my opinion, the best approach is to store each login for each user.
Use a query to summarize the data the way you need it when you query it.
You should only be thinking about other structures if summarizing the detail doesn't meet performance requirements -- which for a monthly report don't seem so onerous.
Whatever you do, storing counts in separate columns is not the right thing to do. Every month, you would need to add another column to the table.
I'm not an expert but in my opinion, it is best to store data in a separate table (in your case). That way you can manipulate the data easily and you don't have to modify the table design in the future.
PK: UserID & Date or New Column (Ex: RowNo with auto increment)
+--------+------------+-----------+
| UserID | Date | NoOfTimes |
+--------+------------+-----------+
| 01 | 2018.01.01 | 1 |
| 01 | 2018.01.02 | 3 |
| 01 | 2018.01.03 | 5 |
| .. | | |
| 02 | 2018.01.01 | 2 |
| 02 | 2018.01.02 | 6 |
+--------+------------+-----------+
Or
PK: UserID, Year & Month or New Column (Ex: RowNo with auto increment)
+--------+------+-------+-----------+
| UserID | Year | Month | NoOfTimes |
+--------+------+-------+-----------+
| 01 | 2018 | Jan | 10 |
| 01 | 2018 | feb | 13 |
+--------+------+-------+-----------+
Before you create the table, please take a look at the database normalization. Especially 1st (1NF), 2nd (2NF) and 3rd (3NF) normalization forms.
https://www.tutorialspoint.com/dbms/database_normalization.htm
https://www.lifewire.com/database-normalization-basics-1019735
https://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/
https://www.studytonight.com/dbms/database-normalization.php
https://medium.com/omarelgabrys-blog/database-normalization-part-7-ef7225150c7f
Either approach is valid, depending on query patterns and join requirements.
One row for each month
For a user, the row containing login count for the month will be inserted when data is available for the month. There will be 1 row per month per user. This design will make it easier to do joins by month column. However, multiple rows will need to be accessed to get data for a user for the year.
-- column list
name
email
month
login_count
-- example entries
'user1', 'user1#email.com','jan',100
'user2', 'user2#email.com','jan',65
'user1', 'user1#email.com','feb',90
'user2', 'user2#email.com','feb',75
One row for all months
You do not need to dynamically add columns, since number of months is known in advance. The table can be initially created to accommodate all months. By default, all month_login_count columns would be initialized to 0. Then, the row would be updated as the login count for the month is populated. There will be 1 row per user. This design is not the best for doing joins by month. However, only one row will need to be accessed to get data for a user for the year.
-- column list
name
email
jan_login_count
feb_login_count
mar_login_count
apr_login_count
may_login_count
jun_login_count
jul_login_count
aug_login_count
sep_login_count
oct_login_count
nov_login_count
dec_login_count
-- example entries
'user1','user1#email.com',100,90,0,0,0,0,0,0,0,0,0,0
'user2','user2#email.com',65,75,0,0,0,0,0,0,0,0,0,0

Is using only 3 timestamps for a bitemporal SQL database possible?

When implementing a bitemporal database in SQL, it is usually recommended to use the following timestamps:
ValidStart
ValidEnd
TransactionStart
TransactionEnd
I have used this approach a few times before, but I have always wondered why having only 3 timestamps, leaving TransactionEnd out, isn't just as correct an implementation. Here a transaction time range spans from TransactionStart to the next TransactionStart.
Are there any strong arguments for not only using 3 timestamps, which will limit the size of the database?
As mentioned in a comment it's for simplicity, since it's somewhat harder to make certain queries without it.
Consider the following example. John is born in some location, Location1, on January first 1990, but is first registered to be born on the fifth.
The database table, Persons, now looks like this:
+----------+--------------+------------+----------+------------+----------+
| Name | Location | valid_from | valid_to | trans_from | trans_to |
+----------+--------------+------------+----------+------------+----------+
| John | Location1 | 01-01-1990 |99-99-9999| 05/01/1990 |99-99-9999|
+----------+--------------+------------+----------+------------+----------+
At this point, removing the trans_to column wouldn't cause too much trouble, but suppose the following:
After some years, say 20, John relocates to Location2, and inform the officials 20 days later.
This will make the Persons table look like this
+----------+--------------+------------+----------+------------+----------+
| Name | Location | valid_from | valid_to | trans_from | trans_to |
+----------+--------------+------------+----------+------------+----------+
| John | Location1 | 01-01-1990 |99-99-9999| 05/01/1990 |20-01-2010|
| John | Location1 | 01-01-1990 |01-01-2010| 20/01/2010 |99-99-9999|
| John | Location2 | 01-01-2010 |99-99-9999| 20/01/2010 |99-99-9999|
+----------+--------------+------------+----------+------------+----------+
Suppose someone wanted to find out "Where does the system think John is living now" (transaction time), regardless of where he actually lives. This can (roughly) be queried in SQL in the following way
Select Location
From Persons
Where Name = John AND trans_from > NOW AND trans_to < NOW
Suppose the transaction end time was removed
+----------+--------------+------------+----------+------------+
| Name | Location | valid_from | valid_to | trans_from |
+----------+--------------+------------+----------+------------+
| John | Location1 | 01-01-1990 |99-99-9999| 05/01/1990 |
| John | Location1 | 01-01-1990 |01-01-2010| 20/01/2010 |
| John | Location2 | 01-01-2010 |99-99-9999| 20/01/2010 |
+----------+--------------+------------+----------+------------+
The query above is of course no longer valid, but making logic for the same query in the last table would be somewhat difficult. Since the trans_to is missing it will have to be derived from the other rows in the table. For instance the implicit trans_to time for the first row (since its the oldest entry) is the trans_from from the second row, which is the newer of the two.
The transaction end time is thus either 9999-99-99, if the row is the newest, or it's the trans_from from the row immediately succeeding it.
This means that the data concerning a specific row, is not entirely kept in that row, and the rows form a dependency on each other, which is (of course) unwanted. Furthermore it can be quite difficult to determine which exact row is the immediate successor of a row, which can make the queries even more complex
An example of using only one timestamp instead of two in an 1D temporal database:
I have a shop and I want to record when an user X was in my shop.
If I use a model with start-time and end-time, this info can be recorded as
X,1,2
X,3,4
so user X was in my shop between 1 and 2 and between 3 and 4. This is clear, simple and concise.
If I model my data with only start-time as a timestamp, I will have:
X,1
X,2
X,3
X,4
but how I can interpret this data?
X from (1,2) and X from (3,4)? or X from (2,3) and X from (1,4)?
or X from (1,2), (2,3), (3,4)? X from (4,inf) is valid?
To understand this data I need to add additional constraints/logic/information to my data or code:
maybe the intervals are non-overlaping, maybe I add an id per object, etc.
All this solutions are not working in all cases, can be difficult to be maintained and other issues.
For e.g.: if I add an id(a,b in this case) to every item, it will result :
X,a,1
X,a,2
X,b,3
X,b,4
instead to store my data in 2 rows,3 columns my data will be stored in 4 rows, 3 columns.
Not only I don't have any benefits using this model but this model can be reduced to:
X,a, 1,2
X,b, 3,4
further reduced to
X, 1,2
X, 3,4

SQL: Creating a common table from multiple similar tables

I have multiple databases on a server, each with a large table where most rows are identical across all databases. I'd like to move this table to a shared database and then have an override table in each application database which has the differences between the shared table and the original table.
The aim is to make updating and distributing the data easier as well as keeping database sizes down.
Problem constraints
The table is a hierarchical data store with date based validity.
table DATA (
ID int primary key,
CODE nvarchar,
PARENT_ID int foreign key references DATA(ID),
END_DATE datetime,
...
)
Each unique CODE in DATA may have a number of rows, but at most a single row where END_DATE is null or greater than the current time (a single valid row per CODE). New references are only made to valid rows.
Updating the shared database should not require anything to be run in application databases. This means any override tables are final once they have been generated.
Existing references to DATA.ID must point to the same CODE, but other columns do not need to be the same. This means any current rows can be invalidated if necessary and multiple occurrences of the same CODE may be combined.
PARENT_ID references must have same parent CODE before and after the split. The actual PARENT_ID value may change if necessary.
The shared table is updated regularly from an external source and these updates need to be reflected in each database's DATA. CODEs that do not appear in the external source can be thought of as invalid, new references to these will not be added.
Existing functionality will continue to use DATA, so the new view (or alternative) must be transparent. It may, however, contain more rows than the original provided earlier constraints are met.
New functionality will use the shared table directly.
Select performance is a concern, insert/update/delete is not.
The solution needs to support SQL Server 2008 R2.
Possible solution
-- in a single shared DB
DATA_SHARED (table)
-- in each app DB
DATA_SHARED (synonym to DATA_SHARED in shared DB)
DATA_OVERRIDE (table)
DATA (view of DATA_SHARED and DATA_OVERRIDE)
Take an existing DATA table to become DATA_SHARED.
Exclude IDs with more than one possible CODE so only rows common across all databases remain. These missing rows will be added back once the data is updated the first time.
Unfortunately every DATA_OVERRIDE will need all rows that differ in any table, not only rows that differ between DATA_SHARED and the previous DATA. There are several IDs that differ only in a single database, this causes all other databases to inflate. Ideas?
This solution causes DATA_SHARED to have a discontinuous ID space. It's a mild annoyance rather than a major issue, but worth noting.
edit: I should be able to keep all of the rows in DATA_SHARED, just invalidate them, then I only need to store differing rows in DATA_OVERRIDE.
I can't think of any situations where PARENT_ID references become invalid, thoughts?
Before:
DB1.DATA
ID | CODE | PARENT_ID | END_DATE
1 | A | NULL | NULL
2 | A1 | 1 | 2020
3 | A2 | 1 | 2010
DB2.DATA
ID | CODE | PARENT_ID | END_DATE
1 | A | NULL | NULL
2 | X | NULL | NULL
3 | A2 | 1 | 2010
4 | X1 | 2 | NULL
5 | A1 | 1 | 2020
After initial processing (DATA_SHARED created from DB1.DATA):
SHARED.DATA_SHARED
ID | CODE | PARENT_ID | END_DATE
1 | A | NULL | NULL
3 | A2 | 1 | 2010
-- END_DATE is omitted from DATA_OVERRIDE as every row is implicitly invalid
DB1.DATA_OVERRIDE
ID | CODE | PARENT_ID
2 | A1 | 1
DB2.DATA_OVERRIDE
ID | CODE | PARENT_ID
2 | X |
4 | X1 | 2
5 | A1 | 1
After update from external data where A1 exists in source but X and X1 don't:
SHARED.DATA_SHARED
ID | CODE | PARENT_ID | END_DATE
1 | A | NULL | NULL
3 | A2 | 1 | 2010
6 | A1 | 1 | 2020
edit: The DATA view would be something like:
select D.ID, ...
from DATA D
left join DATA_OVERRIDE O on D.ID = O.ID
where O.ID is null
union all
select ID, ...
from DATA_OVERRIDE
order by ID
Given the small number of rows in DATA_OVERRIDE, performance is good enough.
Alternatives
I also considered an approach where instead of DATA_SHARED sharing IDs with the original DATA, there would be mapping tables to link DATA.IDs to DATA_SHARED.IDs. This would mean DATA_SHARED would have a much cleaner ID-space and there could be less data duplication, but the DATA view would require some fairly heavy joins. The additional complexity is also a significant negative.
Conclusion
Thank you for your time if you made it all the way to the end, this question ended up quite long as I was thinking it through as I wrote it. Any suggestions or comments would be appreciated.