Join SQL server tables without a unique identifier and duplicate data

Join SQL server tables without a unique identifier and duplicate data - sql

I'm relatively inexperienced in SQL and could use some help beyond the usual SELECT and JOIN.
The Problem
Suppose you have 2 tables you wish to join in Microsoft SQL, however they are missing a unique identifier so duplicates entries are incorrectly generated. I've created an example SQLfiddle to try and demonstrate using a small subset of the full database schema http://sqlfiddle.com/#!18/df3fc.
One table has a list of measurement steps taken for 2 systems, identified by their serial. These measurement steps can contain multiple pieces of data, which are contained in the second table. This would not normally be an issue but, as in the sqlfiddle example for serial=1004, sometimes the same data may be retaken as part of a rework. When I then query, each piece of rework data gets joined to each step, duplicating data. The select query:
SELECT my_measurement_steps.id AS steps_id, my_measurement_steps.serial, my_measurement_data.id AS data_id, my_measurement_data.my_data, my_measurement_data.measurementid, my_measurement_steps.date
FROM my_measurement_steps INNER JOIN
my_measurement_data ON my_measurement_steps.serial = my_measurement_data.serial AND
my_measurement_steps.measurementid = my_measurement_data.measurementid
Desired Output
steps_id
serial
data_id
my_data
measurementid
date
15
1004
36
0.9496555
33
2021-10-12 07:55:58.100
14
1004
35
-0.03252285
11
2021-10-07 07:56:31.530
14
1004
34
-0.0003081787
11
2021-10-07 07:56:31.530
13
1004
33
-0.01728721
10
2021-10-07 07:56:31.530
13
1004
32
-0.1996608
10
2021-10-07 07:56:31.530
12
1004
31
0.003044653
9
2021-10-07 07:24:49.500
12
1004
30
0.002392432
9
2021-10-07 07:24:49.500
11
1004
29
1.012242
8
2021-10-07 07:24:30.720
11
1004
28
1.003897
8
2021-10-07 07:24:30.720
11
1004
27
0.9917302
8
2021-10-07 07:24:30.720
11
1004
26
-0.002975781
8
2021-10-07 07:24:30.720
11
1004
25
-0.002746948
8
2021-10-07 07:24:30.720
10
1004
24
0.9695401
33
2021-10-05 11:37:51.430
9
1005
23
0.9731983
33
2021-10-05 08:00:10.490
8
1005
22
0.01013499
11
2021-10-01 07:12:07.470
8
1005
21
-0.007311231
11
2021-10-01 07:12:07.470
7
1005
20
-0.0003634033
10
2021-10-01 07:12:07.470
7
1005
19
-0.2021408
10
2021-10-01 07:12:07.470
6
1005
18
-0.002507007
9
2021-09-30 13:00:57.260
6
1005
17
0.001181299
9
2021-09-30 13:00:57.260
5
1005
16
1.007857
8
2021-09-30 12:39:50.280
5
1005
15
1.000333
8
2021-09-30 12:39:50.280
5
1005
14
0.9913442
8
2021-09-30 12:39:50.280
5
1005
13
0.002449243
8
2021-09-30 12:39:50.280
5
1005
12
-0.002550488
8
2021-09-30 12:39:50.280
4
1004
11
-0.02970417
11
2021-09-30 06:57:33.160
4
1004
10
-0.0007542603
11
2021-09-30 06:57:33.160
3
1004
9
-0.005267761
10
2021-09-30 06:57:33.160
3
1004
8
-0.2038888
10
2021-09-30 06:57:33.160
2
1004
7
-0.007525305
9
2021-09-30 06:56:59.060
2
1004
6
-0.004998779
9
2021-09-30 06:56:59.060
1
1004
5
0.9935537
8
2021-09-29 12:34:08.090
1
1004
4
0.9952038
8
2021-09-29 12:34:08.090
1
1004
3
0.9978707
8
2021-09-29 12:34:08.090
1
1004
2
-0.0006630127
8
2021-09-29 12:34:08.090
1
1004
1
0.0002386719
8
2021-09-29 12:34:08.090
I'm unsure how to achieve the desired output given the repeating data. Also for some serials there can be more than 1 repeat as shown in the example.
Happy to provide any extra information required.
Many Thanks.
Code to Generate Tables
create table my_measurement_steps(id int, serial int, measurementid int, date datetime);
create table my_measurement_data(id int, serial int, my_data float(7), measurementid int);
insert into my_measurement_steps values
(1,1004,8,'2021-09-29 12:34:08.090'),
(2,1004,9,'2021-09-30 06:56:59.060'),
(3,1004,10,'2021-09-30 06:57:33.160'),
(4,1004,11,'2021-09-30 06:57:33.160'),
(5,1005,8,'2021-09-30 12:39:50.280'),
(6,1005,9,'2021-09-30 13:00:57.260'),
(7,1005,10,'2021-10-01 07:12:07.470'),
(8,1005,11,'2021-10-01 07:12:07.470'),
(9,1004,33,'2021-10-05 08:00:10.490'),
(10,1005,33,'2021-10-05 11:37:51.430'),
(11,1004,8,'2021-10-07 07:24:30.720'),
(12,1004,9,'2021-10-07 07:24:49.500'),
(13,1004,10,'2021-10-07 07:56:31.530'),
(14,1004,11,'2021-10-07 07:56:31.530'),
(15,1004,33,'2021-10-12 07:55:58.100');
insert into my_measurement_data values
(1,1004,0.0002386719,8),
(2,1004,-0.0006630127,8),
(3,1004,0.9978707,8),
(4,1004,0.9952038,8),
(5,1004,0.9935537,8),
(6,1004,-0.004998779,9),
(7,1004,-0.007525305,9),
(8,1004,-0.2038888,10),
(9,1004,-0.005267761,10),
(10,1004,-0.0007542603,11),
(11,1004,-0.02970417,11),
(12,1005,-0.002550488,8),
(13,1005,0.002449243,8),
(14,1005,0.9913442,8),
(15,1005,1.000333,8),
(16,1005,1.007857,8),
(17,1005,0.001181299,9),
(18,1005,-0.002507007,9),
(19,1005,-0.2021408,10),
(20,1005,-0.0003634033,10),
(21,1005,-0.007311231,11),
(22,1005,0.01013499,11),
(23,1004,0.9695401,33),
(24,1005,0.9731983,33),
(25,1004,-0.002746948,8),
(26,1004,-0.002975781,8),
(27,1004,0.9917302,8),
(28,1004,1.003897,8),
(29,1004,1.012242,8),
(30,1004,0.002392432,9),
(31,1004,0.003044653,9),
(32,1004,-0.1996608,10),
(33,1004,-0.01728721,10),
(34,1004,-0.0003081787,11),
(35,1004,-0.03252285,11),
(36,1004,0.9496555,33);
Edits
Added datestamp to measurement step table - sqlfiddle not working so can't update.
All tables now updated and sqlfiddle
Removed section and added desired output

You want to detect blocks of rows belonging together.
When sorting my_measurement_steps we see that serial/measurementid 1004/8 occurs twice for instance, once in row #1 and then again in row #11.
When sorting my_measurement_data we see about the same thing. The serial/measurementid 1004/8 occurs in two blocks, once in rows #1-5 and then again in rows #25-29.
You want to join the serial/measurementid's nth occurence in my_measurement_steps with its nth occurrence in my_measurement_data.
The detection of such blocks is called a gaps and islands problem. This can be done with two concurrent row counts.
with data_groups_found as
(
select
my_measurement_data.*,
row_number() over (order by id) -
row_number() over (partition by serial, measurementid order by id) as grp
from my_measurement_data
)
, data_groups_numbered as
(
select
data_groups_found.*,
dense_rank() over (partition by serial, measurementid order by grp) as grp_id
from data_groups_found
)
, steps_numbered as
(
select
my_measurement_steps.*,
row_number() over (partition by serial, measurementid order by id) as grp_id
from my_measurement_steps
)
select *
from steps_numbered s
left join data_groups_numbered d
on d.serial = s.serial
and d.measurementid = s.measurementid
and d.grp_id = s.grp_id
order by s.id, d.id;
Demo: http://sqlfiddle.com/#!18/df3fc/6

Related

How to get top values when there is a tie

I am having difficulty figuring out this dang problem. From the data and queries I have given below I am trying to see the email address that has rented the most movies during the month of September.
There are only 4 relevant tables in my database and they have been anonymized and shortened:
Table "cust":
cust_id
f_name
l_name
email
1
Jack
Daniels
jack.daniels#google.com
2
Jose
Quervo
jose.quervo#yahoo.com
5
Jim
Beam
jim.beam#protonmail.com
Table "rent"
inv_id
cust_id
rent_date
10
1
9/1/2022 10:29
11
1
9/2/2022 18:16
12
1
9/2/2022 18:17
13
1
9/17/2022 17:34
14
1
9/19/2022 6:32
15
1
9/19/2022 6:33
16
3
9/1/2022 18:45
17
3
9/1/2022 18:46
18
3
9/2/2022 18:45
19
3
9/2/2022 18:46
20
3
9/17/2022 18:32
21
3
9/19/2022 22:12
10
2
9/19/2022 11:43
11
2
9/19/2022 11:42
Table "inv"
mov_id
inv_id
22
10
23
11
24
12
25
13
26
14
27
15
28
16
29
17
30
18
31
19
31
20
32
21
Table "mov":
mov_id
titl
rate
22
Anaconda
3.99
23
Exorcist
1.99
24
Philadelphia
3.99
25
Quest
1.99
26
Sweden
1.99
27
Speed
1.99
28
Nemo
1.99
29
Zoolander
5.99
30
Truman
5.99
31
Patient
1.99
32
Racer
3.99
and here is my current query progress:
SELECT cust.email,
COUNT(DISTINCT inv.mov_id) AS "Rented_Count"
FROM cust
JOIN rent ON rent.cust_id = cust.cust_id
JOIN inv ON inv.inv_id = rent.inv_id
JOIN mov ON mov.mov_id = inv.mov_id
WHERE rent.rent_date BETWEEN '2022-09-01' AND '2022-09-31'
GROUP BY cust.email
ORDER BY "Rented_Count" DESC;
and here is what it outputs:
email
Rented_Count
jack.daniels#google.com
6
jim.beam#protonmail.com
6
jose.quervo#yahoo.com
2
and what I want it to be outputting:
email
jack.daniels#google.com
jim.beam#protonmail.com
From the results I am actually getting I have a tie for first place (Jim and Jack) and that is fine but I would like it to list both tieing email addresses not just Jack's so you cant do anything with rows or max I don't think.
I think it must have something to do with dense_rank but I don't know how to use that specifically in this scenario with the count and Group By?
Your creativity and help would be appreciated.

You're missing the FETCH FIRST ROWS WITH TIES clause. It will work together with the ORDER BY clause to get you the highest values (FIRST ROWS), including ties (WITH TIES).
SELECT cust.email
FROM cust
INNER JOIN rent
ON rent.cust_id = cust.cust_id
INNER JOIN inv
ON inv.inv_id = rent.inv_id
INNER JOIN mov
ON mov.mov_id = inv.mov_id
WHERE rent.rent_date BETWEEN '2022-09-01' AND '2022-09-31'
GROUP BY cust.email
ORDER BY COUNT(DISTINCT inv.mov_id) DESC
FETCH FIRST 1 ROWS WITH TIES

count number of records by month over the last five years where record date > select month

I need to show the number of valid inspectors we have by month over the last five years. Inspectors are considered valid when the expiration date on their certification has not yet passed, recorded as the month end date. The below SQL code is text of the query to count valid inspectors for January 2017:
SELECT Count(*) AS RecordCount
FROM dbo_Insp_Type
WHERE (dbo_Insp_Type.CERT_EXP_DTE)>=#2/1/2017#);
Rather than designing 60 queries, one for each month, and compiling the results in a final table (or, err, query) are there other methods I can use that call for less manual input?

From this sample:
Id
CERT_EXP_DTE
1
2022-01-15
2
2022-01-23
3
2022-02-01
4
2022-02-03
5
2022-05-01
6
2022-06-06
7
2022-06-07
8
2022-07-21
9
2022-02-20
10
2021-11-05
11
2021-12-01
12
2021-12-24
this single query:
SELECT
Format([CERT_EXP_DTE],"yyyy/mm") AS YearMonth,
Count(*) AS AllInspectors,
Sum(Abs([CERT_EXP_DTE] >= DateSerial(Year([CERT_EXP_DTE]), Month([CERT_EXP_DTE]), 2))) AS ValidInspectors
FROM
dbo_Insp_Type
GROUP BY
Format([CERT_EXP_DTE],"yyyy/mm");
will return:
YearMonth
AllInspectors
ValidInspectors
2021-11
1
1
2021-12
2
1
2022-01
2
2
2022-02
3
2
2022-05
1
0
2022-06
2
2
2022-07
1
1

ID
Cert_Iss_Dte
Cert_Exp_Dte
1
1/15/2020
1/15/2022
2
1/23/2020
1/23/2022
3
2/1/2020
2/1/2022
4
2/3/2020
2/3/2022
5
5/1/2020
5/1/2022
6
6/6/2020
6/6/2022
7
6/7/2020
6/7/2022
8
7/21/2020
7/21/2022
9
2/20/2020
2/20/2022
10
11/5/2021
11/5/2023
11
12/1/2021
12/1/2023
12
12/24/2021
12/24/2023
A UNION query could calculate a record for each of 50 months but since you want 60, UNION is out.
Or a query with 60 calculated fields using IIf() and Count() referencing a textbox on form for start date:
SELECT Count(IIf(CERT_EXP_DTE>=Forms!formname!tbxDate,1,Null)) AS Dt1,
Count(IIf(CERT_EXP_DTE>=DateAdd("m",1,Forms!formname!tbxDate),1,Null) AS Dt2,
...
FROM dbo_Insp_Type
Using the above data, following is output for Feb and Mar 2022. I did a test with Cert_Iss_Dte included in criteria and it did not make a difference for this sample data.
Dt1
Dt2
10
8
Or a report with 60 textboxes and each calls a DCount() expression with criteria same as used in query.
Or a VBA procedure that writes data to a 'temp' table.

SQL Temp table Array to perfrom rolling caluclations

I wish to use some sort of SQL array to subtract values from a certain row (QTYOnHand) that decreases that row value every time and throws it into a rolling calculation for the other rows. I've been thinking of some sort of Self Join/Temp Table solution, but not sure how to formulate. Also, All the results will be partitioned by the ItemID below. Help would be appreciated.
Here's some data, If I do a simple row by row subtraction I will get this: 17-3 = 14, 17-5 = 12 and so on.
(Item_ID) (ItemQty) (QTYOnHand) (QtyOnHand - ItemQty)
123 3 17 14
123 5 17 12
123 4 17 13
456 7 12 5
456 8 12 4
456 2 12 10
456 3 12 9
789 2 6 4
789 2 6 4
789 2 6 4
These are the results that I want, where I subtract every next value from the new QTYOnHand-ItemQty column value. Looks like 17-3 then 14 -5 then 9 -4 for Item_ID (123):
(Item_ID) (ItemQty) (QTYOnHand) (QtyOnHand - ItemQty)
123 3 17 14
123 5 17 9
123 4 17 5
456 7 12 5
456 8 12 -3
456 2 12 -5
456 3 12 -8
789 2 6 4
789 2 6 2
789 2 6 0

try the following:
;with cte as
(
select *, ROW_NUMBER() over (partition by Item_ID order by Item_ID) rn
from YourTable
)
, cte2 as
(
select Item_ID, ItemQty, QTYOnHand, Case when rn = 1 then QTYOnHand else 0 end - ItemQty as calc, rn
from cte
)
select Item_ID, ItemQty, QTYOnHand, sum(calc) over (partition by Item_ID order by rn) as [QtyOnHand - ItemQty]
from cte2 t1
Please find the db<>fiddle here.

How to write the query to make report by month in sql

I have the receiving and sending data for whole year. so i want to built the monthly report base on that data with the rule is Fisrt in first out. It means is the first receiving will be sent out first ...
DECLARE #ReceivingTbl AS TABLE(Id INT,ProId int, RecQty INT,ReceivingDate DateTime)
INSERT INTO #ReceivingTbl
VALUES (1,1001,210,'2019-03-12'),
(2,1001,315,'2019-06-15'),
(3,2001,500,'2019-04-01'),
(4,2001,10,'2019-06-15'),
(5,1001,105,'2019-07-10')
DECLARE #SendTbl AS TABLE(Id INT,ProId int, SentQty INT,SendMonth int)
INSERT INTO #SendTbl
VALUES (1,1001,50,3),
(2,1001,100,4),
(3,1001,80,5),
(4,1001,80,6),
(5,2001,200,6)
SELECT * FROM #ReceivingTbl ORDER BY ProId,ReceivingDate
SELECT * FROM #SendTbl ORDER BY ProId,SendMonth
Id ProId RecQty ReceivingDate
1 1001 210 2019-03-12
2 1001 315 2019-06-15
5 1001 105 2019-07-10
3 2001 500 2019-04-01
4 2001 10 2019-06-15
Id ProId SentQty SendMonth
1 1001 50 3
2 1001 100 4
3 1001 80 5
4 1001 80 6
5 2001 200 6
--- And the below is what i want:
Id ProId RecQty ReceivingDate ... Mar Apr May Jun
1 1001 210 2019-03-12 ... 50 100 60 0
2 1001 315 2019-06-15 ... 0 0 20 80
5 1001 105 2019-07-10 ... 0 0 0 0
3 2001 500 2019-04-01 ... 0 0 0 200
4 2001 10 2019-06-15 ... 0 0 0 0
Thanks!

Your question is not clear to me.
If you want to purely use the FIFO approach, therefore ignore any data the table contains, you necessarely need to order by ID, which in your example you are providing, and looks like it is in order of insert.
The first line inserted should be also the first line appearing in the select (FIFO), in order to do so you have to use:
ORDER BY Id ASC
Which will place the lower value of the ID first (1, 2, 3, ...)
To me though, this doesn't make much sense, so pay attention to the meaning o the data you actually have and leverage dates like ReceivingDate, and order by that, maybe even filtering by month of the date, below an example for January data:
WHERE MONTH(ReceivingDate) = 1

select last entry for every user multi table example

Given the following tables
table message
id message time_send
14 "first" 2014-02-10 22:16:31
15 "second" 2014-02-14 09:35:20
16 "third" 2014-02-13 09:35:47
17 "fourth" 2014-03-10 22:16:31
18 "fifth" 2014-03-14 09:35:20
19 "sixth" 2014-04-12 09:35:47
20 "seventh" 2014-04-13 09:35:47
21 "eighth" 2014-04-14 09:35:47
table message_owner
id message_id owner_id cp_id
1 14 1 4
2 14 4 1
3 15 12 4
4 15 4 12
5 16 4 1
6 16 1 4
7 17 12 4
8 17 4 12
9 18 4 1
10 18 1 4
11 19 12 4
12 19 4 12
13 20 12 1
14 20 1 12
15 21 12 7
16 21 7 12
I want to query the most recent message with every counter party(cp_id) for a given owner.
For example for owner_id=4 I would like the following output:
id message time_send owner_id cp_id
18 "fifth" 2014-03-14 09:35:20 4 1
19 "sixth" 2014-02-13 09:35:47 4 12
I see a lot of examples with one table but I am not able to transpose them in a multitable example.
edit1: adding more entries

This should work:
SELECT m.id, m.message, mo.owner_id, mo.cp_id
FROM message m
JOIN message_owner mo ON m.id=mo.message_id
WHERE mo.owner_id=4
AND m.time_send=(
SELECT MAX(time_send)
FROM message m2
JOIN message_owner mo2 ON mo2.message_id=m2.id
WHERE mo2.owner_id=mo.owner_id
AND mo2.cp_id =mo.cp_id
)
... notice though that putting a WHERE condition on a timestamp column can sometimes not work correctly.
Same example without jointures:
SELECT m.id, m.time_send, m.message, mo.owner_id, mo.cp_id
FROM message m, message_owner mo
WHERE m.id = mo.message_id
AND mo.owner_id = 4
AND m.time_send = (
SELECT MAX(time_send)
FROM message m2, message_owner mo2
WHERE mo2.message_id = m2.id
AND mo2.owner_id = mo.owner_id
AND mo2.cp_id = mo.cp_id
);
http://sqlfiddle.com/#!2/558d7/4

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Join SQL server tables without a unique identifier and duplicate data - sql

Related

How to get top values when there is a tie

count number of records by month over the last five years where record date > select month

SQL Temp table Array to perfrom rolling caluclations

How to write the query to make report by month in sql

select last entry for every user multi table example

Categories

Resources