retrieve several values previous to several given dates - sql

I got a values table such as:
id | user_id | value | date
---------------------------------
1 | 12 | 38 | 2014-04-05
2 | 15 | 19 | 2014-04-05
3 | 12 | 47 | 2014-04-08
I want to retrieve all values for given dates. However, if I don't have a value for one specific date, I want to get the previous available value. For instance, with the above dataset, if I query values for user 12 for dates 2014-04-07 and 2014-04-08, I need to retrieve 38 and 47.
I succeeded using two queries like:
SELECT *
FROM values
WHERE date <= $date
ORDER BY date DESC
LIMIT 1
However, it would require dates.length requests each time. So, I'm wondering if there is any more performant solution to retrieve all my values in a single request?

In general, you would use a VALUES clause to specify multiple values in a single query.
If you have only occasional dates missing (and thus no big gaps in dates between rows for any particular user_id) then this would be an elegant solution:
SELECT dt, coalesce(value, lag(value) OVER (ORDER BY dt)) AS value
FROM (VALUES ('2014-04-07'::date), ('2014-04-08')) AS dates(dt)
LEFT JOIN "values" ON "date" = dt AND user_id = 12;
The lag() window function picks the previous value if the current row does not have a value.
If, on the other hand, there may be big gaps, you need to do some more work:
SELECT DISTINCT dt, first_value(value) OVER (ORDER BY diff) AS value
FROM (
SELECT dt, value, dt - "date" AS diff
FROM (VALUES ('2014-04-07'::date), ('2014-04-08')) AS dates(dt)
CROSS JOIN "values"
WHERE user_id = 12) sub;
In this case a CROSS JOIN is made for user_id = 12 and differences between the dates in the VALUES clause and the table rows computed, in a sub-query. So every row has a value for field value. In the main query the value with the smallest difference is selected using the first_value() window function. Note that ordering on diff and picking the first row would not work here because you want values for multiple dates returned.

Related

For each unique item in a redshift sql column, get the last rows based on a looking/scanning window

patient_id
alert_id
alert_timestamp
3
xyz
2022-10-10
1
anp
2022-10-12
1
gfe
2022-10-10
2
fgy
2022-10-02
2
gpl
2022-10-03
1
gdf
2022-10-13
2
mkd
2022-10-23
1
liu
2022-10-01
I have a sql table (see simplified version above) where for each patient_id, I want to only keep the latest alert (i.e. last one) that was sent out in a given window period e.g. window_size = 7.
Note, the window size needs to look at consecutive days i.e. between day 1 -> day 1 + window_size. The ranges of alert_timestamp for each patient_id varies and is usually well beyond the window_size range.
Note, that the table example given above, is a very simple example and will have many more patient_id's and will be in a mixed order in terms alert_timestamp and alert_id.
The approach is to start from the last alert_timstamp for a given patient_id and work back using the window_size to select the alert that was the last one in that window time frame.
Please note the idea is to have a scanning/looking window, example window_size = 7 days to move across the timestamps of each patient
The end result I want, is a table with the filtered out alerts
Expected output for (this example) window_size = 7:
patient_id
alert_id
alert_timestamp
1
liu
2022-10-01
1
gdf
2022-10-13
2
gpl
2022-10-03
2
mkd
2022-10-23
3
xyz
2022-10-10
What's the most efficient way to solve for this?
This can be done with the last_value window function but you need to prep your data a bit. Here's an example of what this could look like:
create table test (
patient_id int,
alert_id varchar(8),
alert_timestamp date);
insert into test values
(3, 'xyz', '2022-10-10'),
(1, 'anp', '2022-10-12'),
(1, 'gfe', '2022-10-10'),
(2, 'fgy', '2022-10-02'),
(2, 'gpl', '2022-10-03'),
(1, 'gdf', '2022-10-13'),
(2, 'mkd', '2022-10-23'),
(1, 'liu', '2022-10-01');
WITH RECURSIVE dates (dt) AS
(
SELECT '2022-09-30'::DATE AS dt UNION ALL SELECT dt + 1
FROM dates d
WHERE dt < '2022-10-31'::DATE
),
p_dates AS
(
SELECT pid,
dt
FROM dates d
CROSS JOIN (SELECT DISTINCT patient_id AS pid FROM test) p
),
combined AS
(
SELECT *
FROM p_dates d
LEFT JOIN test t
ON d.dt = t.alert_timestamp
AND d.pid = t.patient_id
),
latest AS
(
SELECT patient_id,
pid,
alert_id,
dt,
alert_timestamp,
LAST_VALUE(alert_id IGNORE NULLS) OVER (PARTITION BY pid ORDER BY dt ROWS BETWEEN CURRENT ROW AND 7 following) AS at
FROM combined
)
SELECT patient_id,
alert_id,
alert_timestamp
FROM latest
WHERE patient_id IS NOT NULL
AND alert_id = at
ORDER BY patient_id,
alert_timestamp;
This produces the results you are looking for with the test data but there are a few assumptions. The big one is that here is at most 1 alert per patient per day. If this isn't true then some more data massaging will be needed. Either way this should give you an outline on how to do this.
First need is to ensure that there is 1 row per patient per day so that the window function can operate on rows as these will be equivalent to days (for each patient). The date range is generated by a recursive CTE and joined to the test data to achieve the 1 row per day per patient.
The "ignore nulls" option is used in the last_value window function to ignore any of these "extra" rows create by the above process. The last step is to prune out all the unneeded rows and ensure that only the latest alert of the window is produced.

The nearest row in the other table

One table is a sample of users and their purchases.
Structure:
Email | NAME | TRAN_DATETIME (Varchar)
So we have customer email + FirstName&LastName + Date of transaction
and the second table that comes from second system contains all users, they sensitive data and when they got registered in our system.
Simplified Structure:
Email | InstertDate (varchar)
My task is to count minutes difference between the rows insterted from sale(first table)and the rows with users and their sensitive data.
The issue is that second table contain many rows and I want to find the nearest in time row that was inserted in 2nd table, because sometimes it may be a few minutes difeerence(delay or opposite of delay)and sometimes it can be a few days.
So for x email I have row in 1st table:
E_MAIL NAME TRAN_DATETIME
p****#****.eu xxx xxx 2021-10-04 00:03:09.0000000
But then I have 3 rows and the lastest is the one I want to count difference
Email InstertDate
p****#****.eu 2021-05-20 19:12:07
p****#****.eu 2021-05-20 19:18:48
p****#****.eu 2021-10-03 18:32:30 <--
I wrote that some query, but I have no idea how to match nearest row in the 2nd table
SELECT DISTINCT TOP (100)
,a.[E_MAIL]
,a.[NAME]
,a.[TRAN_DATETIME]
,CASE WHEN b.EMAIL IS NOT NULL THEN 'YES' ELSE 'NO' END AS 'EXISTS'
,(ABS(CONVERT(INT, CONVERT(Datetime,LEFT(a.[TRAN_DATETIME],10),120))) - CONVERT(INT, CONVERT(Datetime,LEFT(b.[INSERTDATE],10),120))) as 'DateAccuracy'
FROM [crm].[SalesSampleTable] a
left join [crm].[SensitiveTable] b on a.[E_MAIL]) = b.[EMAIL]
Totally untested: I'd need sample data and database the area of suspect is the casting of dates and the datemath.... since I dont' know what RDBMS and version this is.. consider the following "pseudo code".
We assign a row number to the absolute difference in seconds between the dates those with rowID of 1 win.
WTIH CTE AS (
SELECT A.*, B.* row_number() over (PARTITION BY A.e_mail
ORDER BY abs(datediff(second, cast(Tran_dateTime as Datetime), cast(InsterDate as DateTime)) desc) RN
FROM [crm].[SalesSampleTable] a
LEFT JOIN [crm].[SensitiveTable] b
on a.[E_MAIL] = b.[EMAIL])
SELECT * FROM CTE WHERE RN = 1

How to identify rows per group before a certain value gap?

I'd like to update a certain column in a table based on the difference in a another column value between neighboring rows in PostgreSQL.
Here is a test setup:
CREATE TABLE test(
main INTEGER,
sub_id INTEGER,
value_t INTEGER);
INSERT INTO test (main, sub_id, value_t)
VALUES
(1,1,8),
(1,2,7),
(1,3,3),
(1,4,85),
(1,5,40),
(2,1,3),
(2,2,1),
(2,3,1),
(2,4,8),
(2,5,41);
My goal is to determine in each group main starting from sub_id 1 which value in diff exceeds a certain threshold (e.g. <10 or >-10) by checking in ascending order by sub_id. Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.
Should I use a loop or are there smarter solutions?
The task description in pseudocode:
FOR i in GROUP [PARTITION BY main ORDER BY sub_id]:
DO until diff > 10 OR diff <-10
SET newval = 1 AND LEAD(newval) = 1
Basic SELECT
As fast as possible:
SELECT *, bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
FROM (
SELECT *, value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
FROM test
) sub;
Fine points
Your thought model evolves around the window function lead(). But its counterpart lag() is a bit more efficient for the purpose, since there is no off-by-one error when including the row before the big gap. Alternatively, use lead() with inverted sort order (ORDER BY sub_id DESC).
To avoid NULL for the first row in the partition, provide value_t as default as 3rd parameter, which makes the diff 0 instead of NULL. Both lead() and lag() have that capability.
diff BETWEEN -10 AND 10 is slightly faster than #diff < 11 (clearer and more flexible, too). (# being the "absolute value" operator, equivalent to the abs() function.)
bool_or() or bool_and() in the outer window function is probably cheapest to mark all rows up to the big gap.
Your UPDATE
Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.
Again, as fast as possible.
UPDATE test AS t
SET newval = 1
FROM (
SELECT main, sub_id
, bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
FROM (
SELECT main, sub_id
, value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
FROM test
) sub
) u
WHERE (t.main, t.sub_id) = (u.main, u.sub_id)
AND u.flag;
Fine points
Computing all values in a single query is typically substantially faster than a correlated subquery.
The added WHERE condition AND u.flag makes sure we only update rows that actually need an update.
If some of the rows may already have the right value in newval, add another clause to avoid those empty updates, too: AND t.newval IS DISTINCT FROM 1
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
SET newval = 1 assigns a constant (even though we could use the actually calculated value in this case), that's a bit cheaper.
db<>fiddle here
Your question was hard to comprehend, the "value_t" column was irrelevant to the question, and you forgot to define the "diff" column in your SQL.
Anyhow, here's your solution:
WITH data AS (
SELECT main, sub_id, value_t
, abs(value_t
- lead(value_t) OVER (PARTITION BY main ORDER BY sub_id)) > 10 is_evil
FROM test
)
SELECT main, sub_id, value_t
, CASE max(is_evil::int)
OVER (PARTITION BY main ORDER BY sub_id
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
WHEN 1 THEN NULL ELSE 1 END newval
FROM data;
I'm using a CTE to prepare the data (computing whether a row is "evil"), and then the "max" window function is used to check if there were any "evil" rows before the current one, per partition.
EXISTS on an aggregating subquery:
UPDATE test u
SET value_t = NULL
WHERE EXISTS (
SELECT * FROM (
SELECT main,sub_id
, value_t , ABS(value_t - lag(value_t)
OVER (PARTITION BY main ORDER BY sub_id) ) AS absdiff
FROM test
) x
WHERE x.main = u.main
AND x.sub_id <= u.sub_id
AND x.absdiff >= 10
)
;
SELECT * FROM test
ORDER BY main, sub_id;
Result:
UPDATE 3
main | sub_id | value_t
------+--------+---------
1 | 1 | 8
1 | 2 | 7
1 | 3 | 3
1 | 4 |
1 | 5 |
2 | 1 | 3
2 | 2 | 1
2 | 3 | 1
2 | 4 | 8
2 | 5 |
(10 rows)

Insert in table such that it does not create consecutive duplicates H2

I need to create insert query for H2 db. I need to insert in table such that it do not create consecutive duplicate in table and ignore it. For example,
First name | Last name | Date
A | Z | 2018-12-02
B | Y | 2018-12-03
A | X | 2018-12-04
If I have to insert row `| A | W | 2018-12-01 in above table sorted in ascending order by date, it checks for consecutive duplicate in Firstname column. Since it create consecutive duplicate in table therefore it ignore it.
First name | Last name | Date
A | W | 2018-12-01
A | Z | 2018-12-02
B | Y | 2018-12-03
A | X | 2018-12-04
In H2 you can use the following SQL:
INSERT INTO tableName SELECT * FROM VALUES ('A', 'W', DATE '2018-12-01') T(F, L, D)
WHERE NOT EXISTS (
SELECT * FROM tableName
QUALIFY "First name" = F
AND DENSE_RANK() OVER (ORDER BY "Date")
- DENSE_RANK(D) WITHIN GROUP (ORDER BY "Date") OVER() IN (-1, 0)
);
Here the window version of hypothetical set DENSE_RANK function (don't mix it with window DENSE_RANK function, that's a different one) is used to determine the insert position of a new row:
DENSE_RANK(D) WITHIN GROUP (ORDER BY "Date") OVER()
This aggregate function is a part of the SQL Standard, but in the SQL Standard in may not be used as a window function, but H2 is less restrictive.
Then the plain DENSE_RANK window function is used to number existing rows in the table. The difference between number of a row and number of a hypothetical row is -1 for row before the inserted row and 0 for row after the inserted row.
We need to check only rows with the same "First name" value. So the whole filter criteria will be
"First name" = F
AND DENSE_RANK() OVER (ORDER BY "Date")
- DENSE_RANK(D) WITHIN GROUP (ORDER BY "Date") OVER() IN (-1, 0)
In the SQL Standard you can't filter results after evaluation of window functions, but H2 has non-standard QUALIFY clause (from Teradata) for that purpose, in other DBMS a subquery is needed.
The final condition to decide whether row may be inserted is
WHERE NOT EXISTS (
SELECT * FROM tableName
QUALIFY "First name" = F
AND DENSE_RANK() OVER (ORDER BY "Date")
- DENSE_RANK(D) WITHIN GROUP (ORDER BY "Date") OVER() (-1, 0)
);
If there are no such rows, insertion of a new row will not create two sequential rows with the same "First name" value.
This condition can be used in the plain standard insert from select command.
This solution is not expected to be fast if a table has many rows. A more efficient solution can use subqueries for lookups for previous and next rows such as NOT EXISTS SELECT * FROM (SELECT * FROM tableName WHERE "Date" > D ORDER BY "Date" FETCH FIRST ROW ONLY) UNION (SELECT * FROM tableName WHERE "Date" < D ORDER BY "Date" DESC FETCH FIRST ROW ONLY) WHERE "First name" = F, but H2 doesn't allow references to outer tables from deeply nested queries, so D here needs to be replaced with JDBC parameter (… VALUES (?1, ?2, ?3) … WHERE "Date" < ?3 …). You can try to build such command by yourself.

How to write Oracle query to find a total length of possible overlapping from-to dates

I'm struggling to find the query for the following task
I have the following data and want to find the total network day for each unique ID
ID From To NetworkDay
1 03-Sep-12 07-Sep-12 5
1 03-Sep-12 04-Sep-12 2
1 05-Sep-12 06-Sep-12 2
1 06-Sep-12 12-Sep-12 5
1 31-Aug-12 04-Sep-12 3
2 04-Sep-12 06-Sep-12 3
2 11-Sep-12 13-Sep-12 3
2 05-Sep-12 08-Sep-12 3
Problem is the date range can be overlapping and I can't come up with SQL that will give me the following results
ID From To NetworkDay
1 31-Aug-12 12-Sep-12 9
2 04-Sep-12 08-Sep-12 4
2 11-Sep-12 13-Sep-12 3
and then
ID Total Network Day
1 9
2 7
In case the network day calculation is not possible just get to the second table would be sufficient.
Hope my question is clear
We can use Oracle Analytics, namely the "OVER ... PARTITION BY" clause, in Oracle to do this. The PARTITION BY clause is kind of like a GROUP BY but without the aggregation part. That means we can group rows together (i.e. partition them) and them perform an operation on them as separate groups. As we operate on each row we can then access the columns of the previous row above. This is the feature PARTITION BY gives us. (PARTITION BY is not related to partitioning of a table for performance.)
So then how do we output the non-overlapping dates? We first order the query based on the (ID,DFROM) fields, then we use the ID field to make our partitions (row groups). We then test the previous row's TO value and the current rows FROM value for overlap using an expression like: (in pseudo code)
max(previous.DTO, current.DFROM) as DFROM
This basic expression will return the original DFROM value if it doesnt overlap, but will return the previous TO value if there is overlap. Since our rows are ordered we only need to be concerned with the last row. In cases where a previous row completely overlaps the current row we want the row then to have a 'zero' date range. So we do the same thing for the DTO field to get:
max(previous.DTO, current.DFROM) as DFROM, max(previous.DTO, current.DTO) as DTO
Once we have generated the new results set with the adjusted DFROM and DTO values, we can aggregate them up and count the range intervals of DFROM and DTO.
Be aware that most date calculations in database are not inclusive such as your data is. So something like DATEDIFF(dto,dfrom) will not include the day dto actually refers to, so we will want to adjust dto up a day first.
I dont have access to an Oracle server anymore but I know this is possible with the Oracle Analytics. The query should go something like this:
(Please update my post if you get this to work.)
SELECT id,
max(dfrom, LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) ) as dfrom,
max(dto, LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) ) as dto
from (
select id, dfrom, dto+1 as dto from my_sample -- adjust the table so that dto becomes non-inclusive
order by id, dfrom
) sample;
The secret here is the LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) expression which returns the value previous to the current row.
So this query should output new dfrom/dto values which dont overlap. It's then a simple matter of sub-querying this doing (dto-dfrom) and sum the totals.
Using MySQL
I did haves access to a mysql server so I did get it working there. MySQL doesnt have results partitioning (Analytics) like Oracle so we have to use result set variables. This means we use #var:=xxx type expressions to remember the last date value and adjust the dfrom/dto according. Same algorithm just a little longer and more complex syntax. We also have to forget the last date value any time the ID field changes!
So here is the sample table (same values you have):
create table sample(id int, dfrom date, dto date, networkDay int);
insert into sample values
(1,'2012-09-03','2012-09-07',5),
(1,'2012-09-03','2012-09-04',2),
(1,'2012-09-05','2012-09-06',2),
(1,'2012-09-06','2012-09-12',5),
(1,'2012-08-31','2012-09-04',3),
(2,'2012-09-04','2012-09-06',3),
(2,'2012-09-11','2012-09-13',3),
(2,'2012-09-05','2012-09-08',3);
On to the query, we output the un-grouped result set like above:
The variable #ld is "last date", and the variable #lid is "last id". Anytime #lid changes, we reset #ld to null. FYI In mysql the := operators is where the assignment happens, an = operator is just equals.
This is a 3 level query, but it could be reduced to 2. I went with an extra outer query to keep things more readable. The inner most query is simple and it adjusts the dto column to be non-inclusive and does the proper row ordering. The middle query does the adjustment of the dfrom/dto values to make them non-overlapped. The outer query simple drops the non-used fields, and calculate the interval range.
set #ldt=null, #lid=null;
select id, no_dfrom as dfrom, no_dto as dto, datediff(no_dto, no_dfrom) as days from (
select if(#lid=id,#ldt,#ldt:=null) as last, dfrom, dto, if(#ldt>=dfrom,#ldt,dfrom) as no_dfrom, if(#ldt>=dto,#ldt,dto) as no_dto, #ldt:=if(#ldt>=dto,#ldt,dto), #lid:=id as id,
datediff(dto, dfrom) as overlapped_days
from (select id, dfrom, dto + INTERVAL 1 DAY as dto from sample order by id, dfrom) as sample
) as nonoverlapped
order by id, dfrom;
The above query gives the results (notice dfrom/dto are non-overlapping here):
+------+------------+------------+------+
| id | dfrom | dto | days |
+------+------------+------------+------+
| 1 | 2012-08-31 | 2012-09-05 | 5 |
| 1 | 2012-09-05 | 2012-09-08 | 3 |
| 1 | 2012-09-08 | 2012-09-08 | 0 |
| 1 | 2012-09-08 | 2012-09-08 | 0 |
| 1 | 2012-09-08 | 2012-09-13 | 5 |
| 2 | 2012-09-04 | 2012-09-07 | 3 |
| 2 | 2012-09-07 | 2012-09-09 | 2 |
| 2 | 2012-09-11 | 2012-09-14 | 3 |
+------+------------+------------+------+
How about constructing an SQL which merges intervals by removing holes and considering only maximum intervals. It goes like this (not tested):
SELECT DISTINCT F.ID, F.From, L.To
FROM Temp AS F, Temp AS L
WHERE F.From < L.To AND F.ID = L.ID
AND NOT EXISTS (SELECT *
FROM Temp AS T
WHERE T.ID = F.ID
AND F.From < T.From AND T.From < L.To
AND NOT EXISTS ( SELECT *
FROM Temp AS T1
WHERE T1.ID = F.ID
AND T1.From < T.From
AND T.From <= T1.To)
)
AND NOT EXISTS (SELECT *
FROM Temp AS T2
WHERE T2.ID = F.ID
AND (
(T2.From < F.From AND F.From <= T2.To)
OR (T2.From < L.To AND L.To < T2.To)
)
)
with t_data as (
select 1 as id,
to_date('03-sep-12','dd-mon-yy') as start_date,
to_date('07-sep-12','dd-mon-yy') as end_date from dual
union all
select 1,
to_date('03-sep-12','dd-mon-yy'),
to_date('04-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('05-sep-12','dd-mon-yy'),
to_date('06-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('06-sep-12','dd-mon-yy'),
to_date('12-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('31-aug-12','dd-mon-yy'),
to_date('04-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('04-sep-12','dd-mon-yy'),
to_date('06-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('11-sep-12','dd-mon-yy'),
to_date('13-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('05-sep-12','dd-mon-yy'),
to_date('08-sep-12','dd-mon-yy') from dual
),
t_holidays as (
select to_date('01-jan-12','dd-mon-yy') as holiday
from dual
),
t_data_rn as (
select rownum as rn, t_data.* from t_data
),
t_model as (
select distinct id,
start_date
from t_data_rn
model
partition by (rn, id)
dimension by (0 as i)
measures(start_date, end_date)
rules
( start_date[for i
from 1
to end_date[0]-start_date[0]
increment 1] = start_date[0] + cv(i),
end_date[any] = start_date[cv()] + 1
)
order by 1,2
),
t_network_days as (
select t_model.*,
case when
mod(to_char(start_date, 'j'), 7) + 1 in (6, 7)
or t_holidays.holiday is not null
then 0 else 1
end as working_day
from t_model
left outer join t_holidays
on t_holidays.holiday = t_model.start_date
)
select id,
sum(working_day) as network_days
from t_network_days
group by id;
t_data - your initial data
t_holidays - contains list of holidays
t_data_rn - just adds unique key (rownum) to each row of t_data
t_model - expands t_data date ranges into a flat list of dates
t_network_days - marks each date from t_model as working day or weekend based on day of week (Sat and Sun) and holidays list
final query - calculates number of network day per each group.