Retrieve unknown values based on non-schema specific sequence of column values

Retrieve unknown values based on non-schema specific sequence of column values - sql

I want to return and operate on time values based on their related event values, but only if a specific sequence of events occurs. A simplified example table below:
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+
| id | event1 | time1 | event2 | time2 | event3 | time3 | event4 | time4 | event5 | time5 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+
| abc123 | firstevent | 10:00 | secondevent | 10:01 | thirdevent | 10:02 | fourthevent | 10:03 | fifthevent | 10:04 |
| abc123 | thirdevent | 10:10 | secondevent | 10:11 | thirdevent | 10:12 | firstevent | 10:13 | secondevent | 10:14 |
| def456 | thirdevent | 10:20 | firstevent | 10:21 | secondevent | 10:22 | thirdevent | 10:24 | fifthevent | 10:25 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+
For this table we want to retrieve the times whenever this particular sequence of events occurs: firstevent, secondevent, thirdevent, and a final event of any non-zero value. Meaning the relevant entries returned would be the following:
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+
| id | event1 | time1 | event2 | time2 | event3 | time3 | event4 | time4 | event5 | time5 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+
| abc123 | firstevent | 10:00 | secondevent | 10:01 | thirdevent | 10:02 | fourthevent | 10:03 | null | null |
| null | null | null | null | null | null | null | null | null | null | null |
| def456 | null | null | firstevent | 10:21 | secondevent | 10:22 | thirdevent | 10:24 | fifthevent | 10:26 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+
As shown above the columns are irrelevant to the occurrence of the sequence, with two results being returned starting in both the event1 and event2 columns, thus the solution should be independent and support n number of columns. These values can then be aggregated by the final non-zero event that occurs in the sequence after the 3 fixed variables to give something like the following:
+-------------+-------------------------------+
| FinalEvent | AverageTimeBetweenFinalEvents |
+-------------+-------------------------------+
| fourthevent | 1:00 |
| fifthevent | 2:00 |
+-------------+-------------------------------+

Below is for BigQuery Standard SQL
#standardSQL
WITH search_events AS (
SELECT ['firstevent', 'secondevent', 'thirdevent'] search
), temp AS (
SELECT *, REGEXP_EXTRACT(events, CONCAT(search, r',(\w*)')) FinalEvent
FROM (
SELECT id, [time1, time2, time3, time4, time5] times,
(SELECT STRING_AGG(event) FROM UNNEST([event1, event2, event3, event4, event5]) event) events,
(SELECT STRING_AGG(search) FROM UNNEST(search) search) search
FROM `project.dataset.table`, search_events
)
)
SELECT FinalEvent,
times[SAFE_OFFSET(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(REGEXP_EXTRACT(events, CONCAT(r'(.*?)', search, ',', FinalEvent )), ',')) + 3)] time
FROM temp
WHERE IFNULL(FinalEvent, '') != ''
If to apply to sample data from your question - result is
Row FinalEvent time
1 fourthevent 10:03
2 fifthevent 10:25
So, as you can see - all final events are extracted along with their respective times
Now, you can do here whatever analytics you need - I was not sure about logic behind AverageTimeBetweenFinalEvents, so I am leaving this to you - especially that I think that main focus of the question was extraction of those final events
would you be able to provide the logic behind this statement please?
times[SAFE_OFFSET(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(REGEXP_EXTRACT(events, CONCAT(r'(.*?)', search, ',', FinalEvent )), ',')) + 3)] time
Sure, hope below helps to get a logic behind that expression
assemble regular expression to extract list of events happened before matched events
extract those events
extract all commas into array
calculate position of final event by taking number of commas in above array + 3 (three is to reflect number of positions in search sequence)
extract respective time as an element of times array

Related

How to split these multiple rows in SQL?

I am currently studying SQL and I am still a newbie. I have this task where I need to split some rows with various entries like dates and user IDs. I really need help
+-------+------------------------------+---------------------------+
| TYPE | DATES | USER _ID |
+-------+------------------------------+---------------------------+
| WORK | ["2022-06-02", "2022-06-03"] | {74042,88357,83902,88348} |
| LEAVE | ["2022-05-16", "2022-05-26"] | {83902,74042,88357,88348} |
+-------+------------------------------+---------------------------+
the end result should look like this. the user id's should be aligned or should be in the same as their respective dates.
+-------+------------+---------+
| TYPE | DATES | USER_ID |
+-------+------------+---------+
| LEAVE | 05/16/2022 | 74042 |
| LEAVE | 05/16/2022 | 88357 |
| LEAVE | 05/16/2022 | 88348 |
| LEAVE | 05/16/2022 | 83902 |
| LEAVE | 05/26/2022 | 74042 |
| LEAVE | 05/26/2022 | 88357 |
| LEAVE | 05/26/2022 | 88348 |
| LEAVE | 05/26/2022 | 83902 |
| WORK | 06/2/2022 | 74042 |
| WORK | 06/2/2022 | 88357 |
| WORK | 06/2/2022 | 88348 |
| WORK | 06/2/2022 | 83902 |
| WORK | 06/3/2022 | 74042 |
| WORK | 06/3/2022 | 88357 |
| WORK | 06/3/2022 | 88348 |
| WORK | 06/3/2022 | 83902 |
+-------+------------+---------+

Create table:
CREATE TABLE work_leave (
TYPE varchar,
DATES date,
USER_ID integer
);
INSERT INTO work_leave
VALUES ('LEAVE', '05/16/2022', 74042),
('LEAVE', '05/16/2022', 88357),
('LEAVE', '05/16/2022', 88348),
('LEAVE', '05/16/2022', 83902),
('LEAVE', '05/26/2022', 74042),
('LEAVE', '05/26/2022', 88357),
('LEAVE', '05/26/2022', 88348),
('LEAVE', '05/26/2022', 83902),
('WORK', '06/2/2022', 74042),
('WORK', '06/2/2022', 88357),
('WORK', '06/2/2022', 88348),
('WORK', '06/2/2022', 83902),
('WORK', '06/3/2022', 74042),
('WORK', '06/3/2022', 88357),
('WORK', '06/3/2022', 88348),
('WORK', '06/3/2022', 83902);
WITH date_ends AS (
SELECT
type,
ARRAY[min(dates),
max(dates)] AS dates
FROM
work_leave
GROUP BY
type
),
users AS (
SELECT
type,
array_agg(DISTINCT (user_id)
ORDER BY user_id) AS user_ids
FROM
work_leave
GROUP BY
type
)
SELECT
de.type,
de.dates,
u.user_ids
FROM
date_ends AS de
JOIN
users as u
ON de.type = u.type;
type | dates | user_ids
-------+-------------------------+---------------------------
LEAVE | {05/16/2022,05/26/2022} | {74042,83902,88348,88357}
WORK | {06/02/2022,06/03/2022} | {74042,83902,88348,88357}

I adjusted the data slightly for simplicity. Here's one idea:
WITH rows (type, dates, user_id) AS (
VALUES ('WORK', array['2022-06-02', '2022-06-03'], array[74042,88357,83902,88348])
, ('LEAVE', array['2022-05-16', '2022-05-26'], array[83902,74042,88357,88348])
)
SELECT r1.type, x.*
FROM rows AS r1
CROSS JOIN LATERAL (
SELECT r2.dates, r3.user_id
FROM unnest(r1.dates) AS r2(dates)
, unnest(r1.user_id) AS r3(user_id)
) AS x
;
The fiddle
The result:
type
dates
user_id
WORK
2022-06-02
74042
WORK
2022-06-02
88357
WORK
2022-06-02
83902
WORK
2022-06-02
88348
WORK
2022-06-03
74042
WORK
2022-06-03
88357
WORK
2022-06-03
83902
WORK
2022-06-03
88348
LEAVE
2022-05-16
83902
LEAVE
2022-05-16
74042
LEAVE
2022-05-16
88357
LEAVE
2022-05-16
88348
LEAVE
2022-05-26
83902
LEAVE
2022-05-26
74042
LEAVE
2022-05-26
88357
LEAVE
2022-05-26
88348

how to join tables on cases where none of function(a) in b

Say in MonetDB (specifically, the embedded version from the "MonetDBLite" R package) I have a table "events" containing entity ID codes and event start and end dates, of the format:
| id | start_date | end_date |
| 1 | 2010-01-01 | 2010-03-30 |
| 1 | 2010-04-01 | 2010-06-30 |
| 2 | 2018-04-01 | 2018-06-30 |
| ... | ... | ... |
The table is approximately 80 million rows of events, attributable to approximately 2.5 million unique entities (ID values). The dates appear to align nicely with calendar quarters, but I haven't thoroughly checked them so assume they can be arbitrary. However, I have at least sense-checked them for end_date > start_date.
I want to produce a table "nonevent_qtrs" listing calendar quarters where an ID has no event recorded, e.g.:
| id | last_doq |
| 1 | 2010-09-30 |
| 1 | 2010-12-31 |
| ... | ... |
| 1 | 2018-06-30 |
| 2 | 2010-03-30 |
| ... | ... |
(doq = day of quarter)
If the extent of an event spans any days of the quarter (including the first and last dates), then I wish for it to count as having occurred in that quarter.
To help with this, I have produced a "calendar table"; a table of quarters "qtrs", covering the entire span of dates present in "events", and of the format:
| first_doq | last_doq |
| 2010-01-01 | 2010-03-30 |
| 2010-04-01 | 2010-06-30 |
| ... | ... |
And tried using a non-equi merge like so:
create table nonevents
as select
id,
last_doq
from
events
full outer join
qtrs
on
start_date > last_doq or
end_date < first_doq
group by
id,
last_doq
But this is a) terribly inefficient and b) certainly wrong, since most IDs are listed as being non-eventful for all quarters.
How can I produce the table "nonevent_qtrs" I described, which contains a list of quarters for which each ID had no events?
If it's relevant, the ultimate use-case is to calculate runs of non-events to look at time-till-event analysis and prediction. Feels like run length encoding will be required. If there's a more direct approach than what I've described above then I'm all ears. The only reason I'm focusing on non-event runs to begin with is to try to limit the size of the cross-product. I've also considered producing something like:
| id | last_doq | event |
| 1 | 2010-01-31 | 1 |
| ... | ... | ... |
| 1 | 2018-06-30 | 0 |
| ... | ... | ... |
But although more useful this may not be feasible due to the size of the data involved. A wide format:
| id | 2010-01-31 | ... | 2018-06-30 |
| 1 | 1 | ... | 0 |
| 2 | 0 | ... | 1 |
| ... | ... | ... | ... |
would also be handy, but since MonetDB is column-store I'm not sure whether this is more or less efficient.

Let me assume that you have a table of quarters, with the start date of a quarter and the end date. You really need this if you want the quarters that don't exist. After all, how far back in time or forward in time do you want to go?
Then, you can generate all id/quarter combinations and filter out the ones that exist:
select i.id, q.*
from (select distinct id from events) i cross join
quarters q left join
events e
on e.id = i.id and
e.start_date <= q.quarter_end and
e.end_date >= q.quarter_start
where e.id is null;

SQL finding overlapping dates given start and end date

Given a data set in MS SQL Server 2012 where travelers take trips (with trip_ID as UID) and where each trip has start_date and an end_date, I'm looking to find the trip_ID's for each traveler where trip's overlap and the range of that overlap. So if the initial table looks like this:
| trip_ID | traveler | start_date | end_date | trip_length |
|---------|----------|------------|------------|-------------|
| AB24 | Alpha | 2017-01-29 | 2017-01-31 | 2|
| BA02 | Alpha | 2017-01-31 | 2017-02-10 | 10|
| CB82 | Charlie | 2017-02-20 | 2017-02-23 | 3|
| CA29 | Bravo | 2017-02-26 | 2017-02-28 | 2|
| AB14 | Charlie | 2017-03-06 | 2017-03-08 | 2|
| DA45 | Bravo | 2017-03-26 | 2017-03-29 | 3|
| BA22 | Bravo | 2017-03-29 | 2017-04-03 | 5|
I'm looking for a query that will append three columns to the original table: overlap_id, overlap_start, overlap_end. The idea is that each row will have a value (or NULL) for an overlapping trip along with the start and end dates for overlap itself. Like this:
| trip_ID | traveler | start_date | end_date |trip_length|overlap_id |overlap_start| overlap_end|
|---------|----------|------------|------------|-----------|------------|-------------|------------|
| AB24 | Alpha | 2017-01-29 | 2017-01-31 | 2|BA02--------|2017-01-31---|2017-01-31--|
| BA02 | Alpha | 2017-01-31 | 2017-02-10 | 10|AB24--------|2017-01-31---|2017-01-31--|
| CB82 | Charlie | 2017-02-20 | 2017-02-23 | 3|NULL--------|NULL---------|NULL--------|
| CA29 | Bravo | 2017-02-26 | 2017-02-28 | 2|NULL--------|NULL---------|NULL--------|
| AB14 | Charlie | 2017-03-06 | 2017-03-08 | 2|NULL--------|NULL---------|NULL--------|
| DA45 | Bravo | 2017-03-26 | 2017-03-29 | 3|BA22--------|2017-03-28---|2017-03-29--|
| BA22 | Bravo | 2017-03-28 | 2017-04-03 | 5|DA45--------|2017-03-28---|2017-03-29--|
I've tried variations of Overlapping Dates in SQL to inform my approach but it's not returning the right answers. I'm only looking for overlaps for the same traveler (i.e., within Alpha or Bravo, not between Alpha and Bravo).
For the overlap_id column, I think the code would have to test if a trip's start_date plus range(0, trip_length) returns a value within the range of dates between start_date and end_date for any other trip where the traveler is the same, then the trip_id is updated to equal the id of the matching trips. If this is the right concept, I'm not sure how to make a variable out of trip_length so I test a range of values for it, i.e., run this for all values of trip_length - x until trip_length - x = 0.
--This might be the bare bones of an answer
update table
set overlap_id = CASE
WHEN ( DATEADD(day, trip_length, start_date) = SELECT (DATEADD(day, trip_length, start_date) from table where traveler = traveler)

You can join the table with itself (the join condition is described here):
SELECT t.*, o.trip_ID, o.start_date, o.end_date
FROM t
LEFT JOIN t AS o ON t.trip_ID <> o.trip_ID -- trip always overlaps itself so exclude it
AND o.traveler = t.traveler -- same traveller
AND t.start_date <= o.end_date -- overlap test
AND t.end_date >= o.start_date

Sum of time statuses

I have a simplified version of the table with information about the status of an instrument. I am trying to find the total time from each status. The DateAdded is a timestamp indicating the beginning of the status and the next entry would be the end of the first status and beginning of the next status.
+--------------------+-----------+-------------------------+
| InstrumentStatusId | Statename | DateAdded |
+--------------------+-----------+-------------------------+
| 737062 | alarming | 2018-03-14 00:37:51.423 |
| 737064 | running | 2018-03-14 00:38:12.410 |
| 737065 | running | 2018-03-14 00:38:21.443 |
| 737149 | alarming | 2018-03-14 01:45:03.433 |
| 737152 | error | 2018-03-14 01:45:39.443 |
| 737153 | idle | 2018-03-14 01:45:42.457 |
| 737154 | running | 2018-03-14 01:45:42.460 |
| 737155 | idle | 2018-03-14 01:45:45.490 |
| 737356 | running | 2018-03-14 04:20:21.350 |
| 737382 | idle | 2018-03-14 04:36:03.433 |
| 737383 | running | 2018-03-14 04:36:03.437 |
| 737384 | idle | 2018-03-14 04:36:06.463 |
| 737890 | running | 2018-03-14 10:13:00.313 |
| 738201 | alarming | 2018-03-14 11:10:41.120 |
| 738204 | idle | 2018-03-14 11:11:11.120 |
|
+--------------------+-----------+-------------------------+
I am having trouble figuring out a solution that takes into account multiple same statuses and calculating the time difference between statuses. I have seen similar questions but can't find a solution that has helped me.
I have a sqlfiddle to play with the data.

This answer interprets "status" as being synonymous with statename.
I think you just need the first record for each status. To get that, use lag(), then lead() on the result, then aggregation:
select statename,
sum(datediff(second, dateadded, next_dateadded)) as total_seconds
from (select s.*,
lead(dateadded) over (order by dateadded) as next_dateadded
from (select s.*, lag(statename) over (order by dateadded) as prev_statename
from instrumentstatus s
) s
where prev_statename is null or prev_statename <> statename
) s
group by statename;
Here is a SQL Fiddle.

Only Some Dates From SQL SELECT Being Set To "0" or "1969-12-31" -- UNIX_TIMESTAMP

So I have been doing pretty well on my project (Link to previous StackOverflow question), and have managed to learn quite a bit, but there is this one problem that has been really dogging me for days and I just can't seem to solve it.
It has to do with using the UNIX_TIMESTAMP call to convert dates in my SQL database to UNIX time-format, but for some reason only one set of dates in my table is giving me issues!
==============
So these are the values I am getting -
#abridged here, see the results from the SELECT statement below to see the rest
#of the fields outputted
| firstVst | nextVst | DOB |
| 1206936000 | 1396238400 | 0 |
| 1313726400 | 1313726400 | 278395200 |
| 1318910400 | 1413604800 | 0 |
| 1319083200 | 1413777600 | 0 |
when I use this SELECT statment -
SELECT SQL_CALC_FOUND_ROWS *,UNIX_TIMESTAMP(firstVst) AS firstVst,
UNIX_TIMESTAMP(nextVst) AS nextVst, UNIX_TIMESTAMP(DOB) AS DOB FROM people
ORDER BY "ref DESC";
So my big question is: why in the heck are 3 out of 4 of my DOBs being set to date of 0 (IE 12/31/1969 on my PC)? Why is this not happening in my other fields?
I can see the data quite well using a more simple SELECT statement and the DOB field looks fine...?
#formatting broken to change some variable names etc.
select * FROM people;
| ref | lastName | firstName | DOB | rN | lN | firstVst | disp | repName | nextVst |
| 10001 | BlankA | NameA | 1968-04-15 | 1000000 | 4600000 | 2008-03-31 | Positive | Patrick Smith | 2014-03-31 |
| 10002 | BlankB | NameB | 1978-10-28 | 1000001 | 4600001 | 2011-08-19 | Positive | Patrick Smith | 2011-08-19 |
| 10003 | BlankC | NameC | 1941-06-08 | 1000002 | 4600002 | 2011-10-18 | Positive | Patrick Smith | 2014-10-18 |
| 10004 | BlankD | NameD | 1952-08-01 | 1000003 | 4600003 | 2011-10-20 | Positive | Patrick Smith | 2014-10-20 |

It's because those DoB's are from before 12/31/1969, and the UNIX epoch starts then, so anything prior to that would be negative.
From Wikipedia:
Unix time, or POSIX time, is a system for describing instants in time, defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds.
A bit more elaboration: Basically what you're trying to do isn't possible. Depending on what it's for, there may be a different way you can do this, but using UNIX timestamps probably isn't the best idea for dates like that.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Retrieve unknown values based on non-schema specific sequence of column values - sql

Related

How to split these multiple rows in SQL?

how to join tables on cases where none of function(a) in b

SQL finding overlapping dates given start and end date

Sum of time statuses

Only Some Dates From SQL SELECT Being Set To "0" or "1969-12-31" -- UNIX_TIMESTAMP

Categories

Resources