I ask for your help after several unsuccessful attempts.
I am learning with PL SQL. I am using Oracle SQL developer v.20
I have this situation. My data set looks like this:
id_file size_byte created_at
_________ _________ ____________________________
1 45323 17-FEB-22 17:21:13,726874000
2 41232 17-FEB-22 17:21:13,740587004
3 1234456 20-FEB-22 17:25:13,368874058
4 233545488 20-FEB-22 17:21:18,400049000
5 233545488 21-FEB-22 18:11:18,058746868
So my desired output would be something like this for year 2022:
TOT_records AVG_file_created_for_day TOT_size_files AVG_size_files_created_each_day
___________ ________________________ ______________ _______________________________
9.999.999 10.000 999.999.999 5 MB (default is byte)
ID is type NUMBER, SIZE_BYTE is type NUMBER, CREATED_AT is TIMESTAMP(6)
My table is partitioned for each year, PARTITION_DATE is type DATE
There's some ambiguity on things like "average file size per day"... That could be:
sum all file sizes / total number of days, or
average of files size per day, then take average of that average
Anyway, here's some stuff to get you going (I'm assuming the latter above)
SQL> create table t as
2 select
3 rownum id_file,
4 dbms_random.value(1000,20000000) bytes,
5 date '2021-01-01' + dbms_random.value(1,700) created_at
6 from dual
7 connect by level <= 5000;
Table created.
SQL>
SQL> select * from t
2 where rownum <= 20;
ID_FILE BYTES CREATED_A
---------- ---------- ---------
1 19305636.7 02-SEP-22
2 6305773.83 10-OCT-21
3 11939117.8 04-NOV-21
4 11039507.9 01-SEP-21
5 15555516.8 02-NOV-22
6 2809048.47 13-SEP-22
7 2070381.41 18-DEC-21
8 11116786.1 11-MAR-22
9 17519679.8 21-DEC-21
10 6728222.84 02-APR-22
11 7569442.31 07-AUG-22
12 16949454.2 06-JUL-21
13 8019443.02 03-JUN-21
14 13147674.9 31-AUG-21
15 14590702.5 16-JUL-22
16 13028609.7 11-MAY-21
17 5466477.07 06-APR-22
18 4469902.12 08-MAY-21
19 14511096 31-MAY-22
20 5245726.03 12-JUL-21
20 rows selected.
SQL> select
2 count(*) total_records,
3 avg(daily_size_avg)/1024/1024 avg_size_files_per_day_mb,
4 sum(bytes)/1024/1024/1024 tot_bytes_gb,
5 avg(files_per_day) avg_files_per_day
6 from
7 (
8 select
9 bytes,
10 avg(bytes) over ( partition by trunc(created_at) ) daily_size_avg,
11 count(*) over ( partition by trunc(created_at) ) files_per_day
12 from t
13 );
TOTAL_RECORDS AVG_SIZE_FILES_PER_DAY_MB TOT_BYTES_GB AVG_FILES_PER_DAY
------------- ------------------------- ------------ -----------------
5000 9.5313187 46.5396421 8.092
I'm relatively inexperienced in SQL and could use some help beyond the usual SELECT and JOIN.
The Problem
Suppose you have 2 tables you wish to join in Microsoft SQL, however they are missing a unique identifier so duplicates entries are incorrectly generated. I've created an example SQLfiddle to try and demonstrate using a small subset of the full database schema http://sqlfiddle.com/#!18/df3fc.
One table has a list of measurement steps taken for 2 systems, identified by their serial. These measurement steps can contain multiple pieces of data, which are contained in the second table. This would not normally be an issue but, as in the sqlfiddle example for serial=1004, sometimes the same data may be retaken as part of a rework. When I then query, each piece of rework data gets joined to each step, duplicating data. The select query:
SELECT my_measurement_steps.id AS steps_id, my_measurement_steps.serial, my_measurement_data.id AS data_id, my_measurement_data.my_data, my_measurement_data.measurementid, my_measurement_steps.date
FROM my_measurement_steps INNER JOIN
my_measurement_data ON my_measurement_steps.serial = my_measurement_data.serial AND
my_measurement_steps.measurementid = my_measurement_data.measurementid
Desired Output
steps_id
serial
data_id
my_data
measurementid
date
15
1004
36
0.9496555
33
2021-10-12 07:55:58.100
14
1004
35
-0.03252285
11
2021-10-07 07:56:31.530
14
1004
34
-0.0003081787
11
2021-10-07 07:56:31.530
13
1004
33
-0.01728721
10
2021-10-07 07:56:31.530
13
1004
32
-0.1996608
10
2021-10-07 07:56:31.530
12
1004
31
0.003044653
9
2021-10-07 07:24:49.500
12
1004
30
0.002392432
9
2021-10-07 07:24:49.500
11
1004
29
1.012242
8
2021-10-07 07:24:30.720
11
1004
28
1.003897
8
2021-10-07 07:24:30.720
11
1004
27
0.9917302
8
2021-10-07 07:24:30.720
11
1004
26
-0.002975781
8
2021-10-07 07:24:30.720
11
1004
25
-0.002746948
8
2021-10-07 07:24:30.720
10
1004
24
0.9695401
33
2021-10-05 11:37:51.430
9
1005
23
0.9731983
33
2021-10-05 08:00:10.490
8
1005
22
0.01013499
11
2021-10-01 07:12:07.470
8
1005
21
-0.007311231
11
2021-10-01 07:12:07.470
7
1005
20
-0.0003634033
10
2021-10-01 07:12:07.470
7
1005
19
-0.2021408
10
2021-10-01 07:12:07.470
6
1005
18
-0.002507007
9
2021-09-30 13:00:57.260
6
1005
17
0.001181299
9
2021-09-30 13:00:57.260
5
1005
16
1.007857
8
2021-09-30 12:39:50.280
5
1005
15
1.000333
8
2021-09-30 12:39:50.280
5
1005
14
0.9913442
8
2021-09-30 12:39:50.280
5
1005
13
0.002449243
8
2021-09-30 12:39:50.280
5
1005
12
-0.002550488
8
2021-09-30 12:39:50.280
4
1004
11
-0.02970417
11
2021-09-30 06:57:33.160
4
1004
10
-0.0007542603
11
2021-09-30 06:57:33.160
3
1004
9
-0.005267761
10
2021-09-30 06:57:33.160
3
1004
8
-0.2038888
10
2021-09-30 06:57:33.160
2
1004
7
-0.007525305
9
2021-09-30 06:56:59.060
2
1004
6
-0.004998779
9
2021-09-30 06:56:59.060
1
1004
5
0.9935537
8
2021-09-29 12:34:08.090
1
1004
4
0.9952038
8
2021-09-29 12:34:08.090
1
1004
3
0.9978707
8
2021-09-29 12:34:08.090
1
1004
2
-0.0006630127
8
2021-09-29 12:34:08.090
1
1004
1
0.0002386719
8
2021-09-29 12:34:08.090
I'm unsure how to achieve the desired output given the repeating data. Also for some serials there can be more than 1 repeat as shown in the example.
Happy to provide any extra information required.
Many Thanks.
Code to Generate Tables
create table my_measurement_steps(id int, serial int, measurementid int, date datetime);
create table my_measurement_data(id int, serial int, my_data float(7), measurementid int);
insert into my_measurement_steps values
(1,1004,8,'2021-09-29 12:34:08.090'),
(2,1004,9,'2021-09-30 06:56:59.060'),
(3,1004,10,'2021-09-30 06:57:33.160'),
(4,1004,11,'2021-09-30 06:57:33.160'),
(5,1005,8,'2021-09-30 12:39:50.280'),
(6,1005,9,'2021-09-30 13:00:57.260'),
(7,1005,10,'2021-10-01 07:12:07.470'),
(8,1005,11,'2021-10-01 07:12:07.470'),
(9,1004,33,'2021-10-05 08:00:10.490'),
(10,1005,33,'2021-10-05 11:37:51.430'),
(11,1004,8,'2021-10-07 07:24:30.720'),
(12,1004,9,'2021-10-07 07:24:49.500'),
(13,1004,10,'2021-10-07 07:56:31.530'),
(14,1004,11,'2021-10-07 07:56:31.530'),
(15,1004,33,'2021-10-12 07:55:58.100');
insert into my_measurement_data values
(1,1004,0.0002386719,8),
(2,1004,-0.0006630127,8),
(3,1004,0.9978707,8),
(4,1004,0.9952038,8),
(5,1004,0.9935537,8),
(6,1004,-0.004998779,9),
(7,1004,-0.007525305,9),
(8,1004,-0.2038888,10),
(9,1004,-0.005267761,10),
(10,1004,-0.0007542603,11),
(11,1004,-0.02970417,11),
(12,1005,-0.002550488,8),
(13,1005,0.002449243,8),
(14,1005,0.9913442,8),
(15,1005,1.000333,8),
(16,1005,1.007857,8),
(17,1005,0.001181299,9),
(18,1005,-0.002507007,9),
(19,1005,-0.2021408,10),
(20,1005,-0.0003634033,10),
(21,1005,-0.007311231,11),
(22,1005,0.01013499,11),
(23,1004,0.9695401,33),
(24,1005,0.9731983,33),
(25,1004,-0.002746948,8),
(26,1004,-0.002975781,8),
(27,1004,0.9917302,8),
(28,1004,1.003897,8),
(29,1004,1.012242,8),
(30,1004,0.002392432,9),
(31,1004,0.003044653,9),
(32,1004,-0.1996608,10),
(33,1004,-0.01728721,10),
(34,1004,-0.0003081787,11),
(35,1004,-0.03252285,11),
(36,1004,0.9496555,33);
Edits
Added datestamp to measurement step table - sqlfiddle not working so can't update.
All tables now updated and sqlfiddle
Removed section and added desired output
You want to detect blocks of rows belonging together.
When sorting my_measurement_steps we see that serial/measurementid 1004/8 occurs twice for instance, once in row #1 and then again in row #11.
When sorting my_measurement_data we see about the same thing. The serial/measurementid 1004/8 occurs in two blocks, once in rows #1-5 and then again in rows #25-29.
You want to join the serial/measurementid's nth occurence in my_measurement_steps with its nth occurrence in my_measurement_data.
The detection of such blocks is called a gaps and islands problem. This can be done with two concurrent row counts.
with data_groups_found as
(
select
my_measurement_data.*,
row_number() over (order by id) -
row_number() over (partition by serial, measurementid order by id) as grp
from my_measurement_data
)
, data_groups_numbered as
(
select
data_groups_found.*,
dense_rank() over (partition by serial, measurementid order by grp) as grp_id
from data_groups_found
)
, steps_numbered as
(
select
my_measurement_steps.*,
row_number() over (partition by serial, measurementid order by id) as grp_id
from my_measurement_steps
)
select *
from steps_numbered s
left join data_groups_numbered d
on d.serial = s.serial
and d.measurementid = s.measurementid
and d.grp_id = s.grp_id
order by s.id, d.id;
Demo: http://sqlfiddle.com/#!18/df3fc/6
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have a table with three columns named cid, orderdate, and priororderdate among others.
Here is how the table looks:
cid orderdate priororderdate position
12 NULL NULL 1
12 NULL NULL 2
12 NULL NULL 3
12 2014-08-08 23:25 NULL 1
12 2014-08-08 23:25 NULL 2
12 2014-08-08 23:25 NULL 3
12 2014-08-08 23:25 NULL 4
12 2014-09-06 17:19 2014-08-08 23:25 1
12 2014-09-06 17:19 2014-08-08 23:25 2
12 2014-09-06 17:19 2014-08-08 23:25 3
13 NULL NULL 1
13 NULL NULL 2
13 NULL NULL 3
The combination of the columns cid, orderdatetime, and priororderdatetime defines a unique fpid (a new column I want to create). Hence, the final result would be:
cid orderdate priororderdate position fpid
12 NULL NULL 1 1
12 NULL NULL 2 1
12 NULL NULL 3 1
12 2014-08-08 23:25 NULL 1 2
12 2014-08-08 23:25 NULL 2 2
12 2014-08-08 23:25 NULL 3 2
12 2014-08-08 23:25 NULL 4 2
12 2014-09-06 17:19 2014-08-08 23:25 1 3
12 2014-09-06 17:19 2014-08-08 23:25 2 3
12 2014-09-06 17:19 2014-08-08 23:25 3 3
13 NULL NULL 1 4
13 NULL NULL 2 4
13 NULL NULL 3 4
How can I create the fpid column?
You can do this using dense_rank() in a select query:
select t.*,
dense_rank() over (order by cid, orderdate, priororderdate) as fpid
from table t;
If you have the column fpid already in the table and want to update it:
with toupdate as (
select t.*,
dense_rank() over (order by cid, orderdate, priororderdate) as new_fpid
from table t
)
update toupdate
set fpid = new_fpid;
(If you want to add it, you can use an alter table statement.)
It's a little bit confusion that you say that fpid is unique, but looking at your desired output, it looks like you want to use ROW_NUMBER().
UPDATE tab2 t SET fpid =
(SELECT ROW_NUMBER () OVER (ORDER BY cid)
FROM tab2
GROUP BY cid, orderdate, priororderdate
WHERE t.cid = cid
AND t.orderdate = orderdate
AND t.priororderdate = priororderdate)