How to get the minimum value for a given time-period - sql

I have a table with equipment failure and resolved date. Until the failure is resolved, entries for each day will show as failed. Once the issue is resolved data will start from the next failure date. Below is an example
I want an output which will give me the first failure time for each resolved timestamp like
I tried to do a left join between Resolved timestamp and failure dates AND take the min but that doesn't work.

Consider below approach
select type,
max(timestamp) resolved_timestamp,
min(timestamp) first_failure_timestamp
from (
select *, countif(status='resolved') over win as grp
from your_table
window win as (partition by type order by timestamp rows between unbounded preceding and 1 preceding)
)
group by type, grp
if applied to sample data in y our question - output is

Related

Get first record based on time in PostgreSQL

DO we have a way to get first record considering the time.
example
get first record today, get first record yesterday, get first record day before yesterday ...
Note: I want to get all records considering the time
sample expected output should be
first_record_today,
first_record_yesterday,..
As I understand the question, the "first" record per day is the earliest one.
For that, we can use RANK and do the PARTITION BY the day only, truncating the time.
In the ORDER BY clause, we will sort by the time:
SELECT sub.yourdate FROM (
SELECT yourdate,
RANK() OVER
(PARTITION BY DATE_TRUNC('DAY',yourdate)
ORDER BY DATE_TRUNC('SECOND',yourdate)) rk
FROM yourtable
) AS sub
WHERE sub.rk = 1
ORDER BY sub.yourdate DESC;
In the main query, we will sort the data beginning with the latest date, meaning today's one, if available.
We can try out here: db<>fiddle
If this understanding of the question is incorrect, please let us know what to change by editing your question.
A note: Using a window function is not necessary according to your description. A shorter GROUP BY like shown in the other answer can produce the correct result, too and might be absolutely fine. I like the window function approach because this makes it easy to add further conditions or change conditions which might not be usable in a simple GROUP BY, therefore I chose this way.
EDIT because the question's author provided further information:
Here the query fetching also the first message:
SELECT sub.yourdate, sub.message FROM (
SELECT yourdate, message,
RANK() OVER (PARTITION BY DATE_TRUNC('DAY',yourdate)
ORDER BY DATE_TRUNC('SECOND',yourdate)) rk
FROM yourtable
) AS sub
WHERE sub.rk = 1
ORDER BY sub.yourdate DESC;
Or if only the message without the date should be selected:
SELECT sub.message FROM (
SELECT yourdate, message,
RANK() OVER (PARTITION BY DATE_TRUNC('DAY',yourdate)
ORDER BY DATE_TRUNC('SECOND',yourdate)) rk
FROM yourtable
) AS sub
WHERE sub.rk = 1
ORDER BY sub.yourdate DESC;
Updated fiddle here: db<>fiddle

How to track whether field has changed after a date

In my problem, I want to be able to track whether a state has shifted from 04a. Lapsing - Lowering Engagement to 03d. Engaged - Very High after trigger_send_date has occurred.
I believe a window function is required that checks whether a state is 04a. Lapsing - Lowering Engagement before trigger_send_date, and then measures whether that changes after trigger_send_date is needed , but I can't figure out how to write this. I made a start below, but have difficulty continuing!
Ideally I'd like a new column that is a True/False as to whether that switching has occurred post trigger_send_date within 31 days of the date occuring.
SELECT
cust_id,
state_date,
trigger_send_date,
total_state,
IF (
total_state IN ("04a. Lapsing - Lowering Engagement"),
True,
False
) as lapse,
-- Trying to write this column
sum(IF ((trigger_send_date >= state_date) & (total_state IN ("04a. Lapsing - Lowering Engagement") , 1, null)) OVER (
PARTITION BY cust_id,
state_date
ORDER BY
state_date
) as lapsed_and_returned_wirthin_31_days
FROM
base
ORDER BY
state_date,
trigger_send_date
Does anyone have any tips to help me write this?
This is what my table looks like with expected result as right-most column if it helps!
Let me preface my answer by saying that I don't have access to spark SQL, so the below is written in MySQL (it would probably work in SQL Server as well). I've had a look at the docs and the window frame should still work, you obviously might need to make some tweaks.
The window frame tells the partition function which rows to look at, by included UNBOUNDED PRECEDING you're telling the function to include every row before the current row, and using UNBOUNDED FOLLOWING you're telling the function to look at every row after the current row.
I tried to include another test, for a customer that was engaged before the trigger date and it seems to work. Obviously if you provided some sample data we could test further.
DROP TABLE IF EXISTS Base;
CREATE TABLE Base
(
cust_id BIGINT,
state_date DATE,
trigger_send_date DATE,
total_state VARCHAR(256)
);
INSERT INTO Base (cust_id,state_date, trigger_send_date, total_state) VALUES
(9177819375032029782,'2022-03-07','2022-03-14','03d. Engaged - Very High'),
(9177819375032029782,'2022-03-13','2022-03-14','04a. Lapsing - Lowering Engagement'),
(9177819375032029782,'2022-03-19','2022-03-14','03d. Engaged - Very High'),
(9177819375032029782,'2022-05-07','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-03-07','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-03-10','2022-03-14','04a. Lapsing - Lowering Engagement'),
(819375032029782,'2022-03-11','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-03-19','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-05-07','2022-03-14','03d. Engaged - Very High');
With LapsedCTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY cust_id ORDER BY state_date DESC) AS `RNum`
FROM Base
WHERE state_date <= trigger_send_date
AND LEFT(total_state, 3) IN ('03d','04a')
)
SELECT b.cust_id, b.state_date, b.trigger_send_date, b.total_state,
IF (
b.total_state IN ("04a. Lapsing - Lowering Engagement"),
True,
False
) as lapse,
-- Here we find the MIN engaged date (you can other states if needed) AFTER the trigger date.
-- Then we compare that to the trigger_send_date from the list of customers that were lapsed prior to the trigger_send_date (this will be empty for non-lapsed customers
-- so will default to 0 in our results column
-- Then we do a DATEDIFF between the trigger date and the engaged date, if the value is less than or equal to 31 days, Robert is your Mother's Brother..
IF(DATEDIFF(
MIN(IF(b.state_date > b.trigger_send_date AND LEFT(b.total_state, 3) IN ('03d'), b.state_date, NULL))
OVER (PARTITION BY b.cust_id ORDER BY b.state_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), l.trigger_send_date) <= 31, 1, 0) AS `lapsed_and_returned_wirthin_31_days`
-- Here's some other stuff just to show you the inner working of the above
/*
DATEDIFF(
MIN(IF(b.state_date > b.trigger_send_date AND LEFT(b.total_state, 3) IN ('03d'), b.state_date, NULL))
OVER (PARTITION BY b.cust_id ORDER BY b.state_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), b.trigger_send_date) AS `engaged_time_lag_days`,
MIN(IF(b.state_date > b.trigger_send_date AND LEFT(b.total_state, 3) IN ('03d'), b.state_date, NULL))
OVER (PARTITION BY b.cust_id ORDER BY b.state_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS `first_engaged_date_after_trigger`
*/
FROM Base b
LEFT JOIN LapsedCTE l ON l.cust_id = b.cust_id AND l.RNum = 1 AND LEFT(l.total_state, 3) IN ('04a');
It would be possible to remove the CTE if you need, it just makes things a bit cleaner.
Here's a runnable DBFiddle just incase you don't have access to a MySQL database.

How fill NULLs with previous value in SQL

I have the following table
There are some NULL values in price column, which I want to replace with the previous date value (date is manual_date). Additionally, price column is calculated on different dates (calculation_table), so the nulls should be filled based on this group filter.
The final output should show values similar to output_price.
I found here a code that does the same thing, however, I could not figure out how to do it with my data (one of the error says I have not ts in (PARTITION BY symbol ORDER BY ts). This is true, but in the website, there is also no ts specified + I tried to replace it ts with manual_date)
I tried following code for my data
select manual_date,TS_FIRST_VALUE(price, 'const') output_price
from MYDATA
TIMESERIES manual_date AS '1 month'
OVER(PARTITION BY calculation_date ORDER BY ts) --tried also ORDER BY manual_date
Vertical supports IGNORE NULLS on last_value(). So you can use:
last_value(price ignore nulls) over (
order by manual_date
rows between unbounded preceding and current row
) as output_price

Stuck on what seems like a simple SQL dense_rank task

Been stuck on this issue and could really use a suggestion or help.
What I have in a table is basic user flow on a website. For every Session ID, there's a page visited from start (lands on homepage) to finish (purchase). This has been ordered by timestamp to get a count of pages visited during this process. This 'page count' has also been partitioned by Session ID to go back to 1 every time the ID changes.
What I need to do now is assign a step count (highlighted is what I'm trying to achieve). This should assign a similar count but doesn't continue counting at duplicate steps (ie, someone visited multiple product pages - it's multiple pages but still only one 'product view' step.
You'd think this would be done using a dense rank, partitioned by session id - but that's where I get stuck. You can't order on page count because that'll assign a unique number to each step count. You can't order by Step because that orders it alphabetically.
What could I do to achieve this?
Screenshot of desired outcome:
Many thanks!
Use lag to see if two values are the same then a cumulative sum:
select t.*,
sum(case when prev_cs = custom_step then 0 else 1 end) over (partition by session_id order by timestamp) as steps_count
from (select t.*,
lag(custom_step) over (partition by session_id order by timestamp) as prev_cs
from t
) t
Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(flag),
COUNTIF(IFNULL(flag, TRUE)) OVER(PARTITION BY session_id ORDER BY timestamp) AS steps_count
FROM (
SELECT *,
custom_step != LAG(custom_step) OVER(PARTITION BY session_id ORDER BY timestamp) AS flag
FROM `project.dataset.table`
)
-- ORDER BY timestamp

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.
I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?