Remove newest redundant row and update timestamp - sql

I'm working with a SQLite database that receives large data dumps on a regular basis from several sources. Unfortunately, those sources aren't intelligent about what they dump, and I end up with a lot of repeated records from one time to the next. I'm looking for a way to remove these repeated records without affecting the records that have legitimately changed from the past dump to this one.
Here's the general structure of the data (_id is the primary key):
| _id | _dateUpdated | _dateEffective | _dateExpired | name | status | location |
|-----|--------------|----------------|--------------|------|--------|----------|
| 1 | 2016-05-01 | 2016-05-01 | NULL | Fred | Online | USA |
| 2 | 2016-05-01 | 2016-05-01 | NULL | Jim | Online | USA |
| 3 | 2016-05-08 | 2016-05-08 | NULL | Fred | Offline| USA |
| 4 | 2016-05-08 | 2016-05-08 | NULL | Jim | Online | USA |
| 5 | 2016-05-15 | 2016-05-15 | NULL | Fred | Offline| USA |
| 6 | 2016-05-15 | 2016-05-15 | NULL | Jim | Online | USA |
I'd like to be able to reduce this data to something like this:
| _id | _dateUpdated | _dateEffective | _dateExpired | name | status | location |
|-----|--------------|----------------|--------------|------|--------|----------|
| 1 | 2016-05-01 | 2016-05-01 | 2016-05-07 | Fred | Online | USA |
| 2 | 2016-05-15 | 2016-05-01 | NULL | Jim | Online | USA |
| 3 | 2016-05-15 | 2016-05-08 | NULL | Fred | Offline| USA |
The idea here is that rows 4, 5, and 6 exactly duplicate rows 2 and 3 except for the timestamps (I'd need to compare by all three fields - name, status, location). However, row 3 does not duplicate row 1 (status changed from Online to Offline), so the _dateExpired field is set in row 1, and row 3 becomes the most recent record.
I'm querying this table with something like this:
SELECT * FROM Data WHERE
date(_dateEffective) <= date("now")
AND (_dateExpired IS NULL OR date(_dateExpired) > date("now"))
Is this sort of reduction possible in SQLite?
I am still a beginner to SQL and database design in general, so it's possible that I haven't structured the database in the best way. I'm open to suggestions there as well...I'm going for the ability to query data at a given point in time - for example, "what was Jim's status around 2016-05-06?"
Thanks in advance!

Consider using a staging table where the dump file goes into a DumpTable (regularly cleaned out before each dump) and then an INSERT...SELECT query migrates to your final table.
Now the SELECT portion maintains a correlated subquery (to calculate new [_dateExpired] for needed rows) and derived table subquery (to filter out non-dups according to your criteria). Finally, the LEFT JOIN...NULL with FinalTable is to ensure no duplicate records are appended, assuming [_id] is a unique identifier. Below is the routine:
Clean Out DumpTable
DELETE FROM DumpTable;
Run Dump Routine to be appended into DumpTable
Append Records to FinalTable
INSERT INTO FinalTable ([_id], [_dateUpdated], [_dateEffective], [_dateExpired],
[name], status, location)
SELECT d.[_id], d.[_dateUpdated], d.[_dateEffective],
(SELECT Min(date(sub.[_dateEffective], '-1 day'))
FROM DumpTable sub
WHERE sub.[name] = DumpTable.[name]
AND sub.[_dateEffective] > DumpTable.[_dateEffective]
AND sub.status <> DumpTable.status) As calcExpired
d.name, d.status, d.location
FROM DumpTable d
INNER JOIN
(SELECT Min(DumpTable.[_id]) AS min_id,
DumpTable.name, DumpTable.status
FROM DumpTable
GROUP BY DumpTable.name, DumpTable.status) AS c
ON (c.name = d.name)
AND (c.min_id = d.[_id])
AND (c.status = d.status)
LEFT JOIN FinalTable f
ON d.[_id] = f.[_id]
WHERE f.[_id] IS NULL;
-- INSERTED RECORDS:
-- _id _dateUpdated _dateEffective _dateExpired name status location
-- 1 2016-05-01 2016-05-01 2016-05-07 Fred Online USA
-- 2 2016-05-01 2016-05-01 Jim Online USA
-- 3 2016-05-08 2016-05-08 Fred Offline USA

Is this sort of reduction possible in SQLite?
The answer to any "reduction" question in SQL is always Yes. The trick is to find what axes you're reducing along.
Here's a partial solution to illustrate; it gives the first Online date for each name & location.
select min(_dateEffective) as start_date
, name
, location
from Data
where status = 'Online'
group by
name
, location
With an outer join back to the table (on name & location) where the status is 'Offline' and the _dateEffective is greater than start_date, you get your _dateExpired.
_id is the primary key
There is a commonly held misunderstanding that every table needs some kind of sequential "ID" number as a primary key. The key you really care about is known as a natural key, 1 or more columns in the data that uniquely identify the data. In your case, it looks to me like that's _dateEffective, name, status, and location. At the very least, declare them unique to prevent accidental duplication.

Related

Remove duplicate rows in MS-Access

I am using Microsoft Access and in it, I have a table with data that is sometimes repeated. I'm not able to create an SQL query that removes duplicate data, leaving only distinct data in the table. Can someone help me?
My current table:
Date | Level | Name
---------+--------+--------
12/25/2021 | 2 | Jack
12/25/2021 | 2 | Jack
12/10/2021 | 3 | Ana
12/01/2021 | 1 | Lenon
12/01/2021 | 1 | Lenon
12/30/2021 | 3 | Ana
Expected result:
Date | Level | Name
---------+--------+--------
12/25/2021 | 2 | Jack
12/10/2021 | 3 | Ana
12/01/2021 | 1 | Lenon
12/30/2021 | 3 | Ana
PS: Ana appears twice in the expected result table because the dates of the two rows referring to Ana are different, so they are not duplicated values.
Just use select distinct:
select distinct t.*
from t;
I would add that tables should not have duplicate rows. Something is wrong with the table generation if you are getting duplicates -- either the query being used or the process for inserting rows into the table.
You can do a group by of the Date, Level and Name columns.
Use this query:
SELECT Date
,Level
,Name
FROM <TableName>
GROUP BY Date, Level, Name

SQL - specific requirement to compare tables

I'm trying to merge 2 queries into 1 (cuts the number of daily queries in half): I have 2 tables, I want to do a query against 1 table, then the same query against the other table that has the same list just less entries.
Basically its a list of (let's call it for obfuscation) people and hobby. One table is ALL people & hobby, the other shorter list is people & hobby that I've met. Table 2 would all be found in table 1. Table 1 includes entries (people I have yet to meet) not found in table 2
The tables are synced up from elsewhere, what I'm looking to do is print a list of ALL people in the first column then print the hobby ONLY of people that are on both lists. That way I can see the lists merged, and track the rate at which the gap between both lists is closing. I have tried a number of SQL combinations but they either filter out the first table and match only items that are true for both (i.e. just giving me table 2) or just adding table 2 to table 1.
Example of what I'm trying to do below:
+---------+----------+--+----------+---------+--+---------+----------+
| table1 | | | table2 | | | query | |
+---------+----------+--+----------+---------+--+---------+----------+
| name | hobby | | activity | person | | name | hobby |
| bob | fishing | | fishing | bob | | bob | fishing |
| bill | vidgames | | hiking | sarah | | bill | |
| sarah | hiking | | planking | sabrina | | sarah | hiking |
| mike | cooking | | | | | mike | |
| sabrina | planking | | | | | sabrina | planking |
+---------+----------+--+----------+---------+--+---------+----------+
Normally I'd just take the few days to learn SQL a bit better however I'm stretched pretty thin at work as it is!
I should mention the table 2 is flipped and the headings are all unique (don't think this matters)!
I think you just want a left join:
select t1.name, t2.activity as hobby
from table1 t1 left join
table2 t2
on t1.name = t2.person;

Returning singular row/value from joined table date based on closest date

I have a Production Table and a Standing Data table. The relationship of Production to Standing Data is actually Many-To-Many which is different to how this relationship is usually represented (Many-to-One).
The standing data table holds a list of tasks and the score each task is worth. Tasks can appear multiple times with different "ValidFrom" dates for changing the score at different points in time. What I am trying to do is query the Production Table so that the TaskID is looked up in the table and uses the date it was logged to check what score it should return.
Here's an example of how I want the data to look:
Production Table:
+----------+------------+-------+-----------+--------+-------+
| RecordID | Date | EmpID | Reference | TaskID | Score |
+----------+------------+-------+-----------+--------+-------+
| 1 | 27/02/2020 | 1 | 123 | 1 | 1.5 |
| 2 | 27/02/2020 | 1 | 123 | 1 | 1.5 |
| 3 | 30/02/2020 | 1 | 123 | 1 | 2 |
| 4 | 31/02/2020 | 1 | 123 | 1 | 2 |
+----------+------------+-------+-----------+--------+-------+
Standing Data
+----------+--------+----------------+-------+
| RecordID | TaskID | DateActiveFrom | Score |
+----------+--------+----------------+-------+
| 1 | 1 | 01/02/2020 | 1.5 |
| 2 | 1 | 28/02/2020 | 2 |
+----------+--------+----------------+-------+
I have tried the below code but unfortunately due to multiple records meeting the criteria, the production data duplicates with two different scores per record:
SELECT p.[RecordID],
p.[Date],
p.[EmpID],
p.[Reference],
p.[TaskID],
s.[Score]
FROM ProductionTable as p
LEFT JOIN StandingDataTable as s
ON s.[TaskID] = p.[TaskID]
AND s.[DateActiveFrom] <= p.[Date];
What is the correct way to return the correct and singular/scalar Score value for this record based on the date?
You can use apply :
SELECT p.[RecordID], p.[Date], p.[EmpID], p.[Reference], p.[TaskID], s.[Score]
FROM ProductionTable as p OUTER APPLY
( SELECT TOP (1) s.[Score]
FROM StandingDataTable AS s
WHERE s.[TaskID] = p.[TaskID] AND
s.[DateActiveFrom] <= p.[Date]
ORDER BY S.DateActiveFrom DESC
) s;
You might want score basis on Record Level if so, change the where clause in apply.

How to insert from each row into to multiple tables

I'm pretty new to SQL Server (using ssms). I need some help to insert and organize data from one table into multiple tables (which are connected to each other by PK/FK).
The source table has the following columns:
Email, UserName, Phone
It's a messy table with lots of duplicates: same email with different username and so on..
My data tables are:
Person - PersonID(PK, int, not null)
Email - Email (nvarchar, null) , PersonID (FK, int, not null)
Phone - PhoneNumber (int, null) , PersonID (FK, int, not null)
UserName - UserName (nvarchar, null) , PersonID (FK, int, not null)
For each row in the source table, I need to check if the person already exists (by Email); if it does exist, I need to add the new data (if any), else I need to create a new person and add the data.
I searched here for some solutions and find recommendations of using CURSOR.
I tried it, but it takes a really long time to execute (hours.. and still going)
Thanks for any help!
example:
from>
EMAIL | USERNAME | PHONE
------------------------
a#a.a | john | 956165
b#b.b | smith | 123456
c#c.c | bob | 654321
d#d.d | mike | 986514
a#a.a | dan | 658732
e#e.e | dave | 147258
f#f.f | harry | 951962
b#b.b | emmy | 456789
g#g.g | kelly | 789466
h#h.h | kelly | 258369
a#a.a | ana | 852369
to>
EMAIL | PERSONID
----------------
a#a.a | 1
b#b.b | 2
c#c.c | 3
d#d.d | 4
e#e.e | 5
f#f.f | 6
g#g.g | 7
h#h.h | 8
USERNAME | PERSONID
-------------------
john | 1
smith | 2
bob | 3
mike | 4
dan | 1
dave | 5
harry | 6
emmy | 2
kelly | 7
kelly | 8
ana | 1
PHONE | PERSONID
----------------
956165 | 1
123456 | 2
654321 | 3
986514 | 4
658732 | 1
147258 | 5
951962 | 6
456789 | 2
789466 | 7
258369 | 8
852369 | 1
Cursors will generally be slower as they operate on a row-by-row basis. Using set based operations, such as a join, will yield better performance. It's somewhat older, but this article further details the implications of cursors as opposed to set operations. I wasn't entirely sure what columns you want to use to verify matches, as well as what data to add, but a basic example is below and you can fill in the columns as necessary. The Email table was used in the example. For the UPDATE, this will update existing rows based off corresponding rows in the source table. Being an INNER JOIN, only rows with matches on both sides will be impacted. In the second statement, this is an INSERT using only rows from the source table that don't exist in the Email table. This same functionality could also be accomplished using the MERGE statement, however there are a number of issues with this, including problems with deadlocks and key violations.
Update Existing Rows:
UPDATE E
SET E.ColumnA = SRC.ColumnA,
E.ColumnB = SRC.ColumnB
FROM YourDatabase.YourSchema.Email E
INNER JOIN YourDatabase.YourSchema.SourceTable SRC
ON E.Email = SRC.Email
Add New Rows:
INSERT INTO YourDatabase.YourSchema.Email (ColumnA, ColumnB)
SELECT
ColumnA,
ColumnB
FROM YourDatabase.YourSchema.SourceTable
WHERE EMAIL NOT IN ((SELECT EMAIL FROM YourDatabase.YourSchema.Email))

Randomly Populating Foreign Key In Sample Data Set

I'm generating test data for a new database, and I'm having trouble populating one of the foreign key fields. I need to create a relatively large number (1000) of entries in a table (SurveyResponses) that has a foreign key to a table with only 6 entries (Surveys)
The database already has a Schools table that has a few thousand records. For arguments sake lets say it looks like this
Schools
+----+-------------+
| Id | School Name |
+----+-------------+
| 1 | PS 1 |
| 2 | PS 2 |
| 3 | PS 3 |
| 4 | PS 4 |
| 5 | PS 5 |
+----+-------------+
I'm creating a new Survey table. It will only have about 3 rows.
Survey
+----+-------------+
| Id | Col2 |
+----+-------------+
| 1 | 2014 Survey |
| 2 | 2015 Survey |
| 3 | 2016 Survey |
+----+-------------+
SurveyResponses simply ties a school to a survey.
Survey Responses
+----+----------+----------+
| Id | SchoolId | SurveyId |
+----+----------+----------+
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 1 |
| 4 | 4 | 3 |
| 5 | 5 | 2 |
+----+----------+----------+
Populating the SurveyId field is what's giving me the most trouble. I can randomly select 1000 Schools, but I haven't figured out a way to generate 1000 random SurveyIds. I've been trying to avoid a while loop, but maybe that's the only option?
I've been using Red Gate SQL Data Generator to generate some of my test data, but in this case I'd really like to understand how this can be done with raw SQL.
Here is one way, using a correlated subquery to get a random survey associated with each school:
select s.schoolid,
(select top 1 surveyid
from surveys
order by newid()
) as surveyid
from schools s;
Note: This doesn't seem to work. Here is a SQL Fiddle showing the non-workingness. I am quite surprised it doesn't work, because newid() should be a
EDIT:
If you know the survey ids have no gaps and start with 1, you can do:
select 1 + abs(checksum(newid()) % 3) as surveyid
I did check that this does work.
EDIT II:
This appears to be overly aggressive optimization (in my opinion). Correlating the query appears to fix the problem. So, something like this should work:
select s.schoolid,
(select top 1 surveyid
from surveys s2
where s2.surveyid = s.schoolid or s2.surveyid <> s.schoolid -- nonsensical condition to prevent over optimization
order by newid()
) as surveyid
from schools s;
Here is a SQL Fiddle demonstrating this.