i do have a customerand he use over a specific period of time different devices, tracked with a valid_from and valid_to date. but, every time something changes for this device there will be a new row written without any visible changes for the row based data, besides a new valid from/to.
what i'm trying to do is to aggregate the first two rows into one, same for row 3 and 4, while leaving 5 and 6 as they are. all my solutions i came up so far with are working for a usage history for the user not switching back to device a. everything keeps failing.
i'd really appreciate some help, thanks in advance!
If you know that the previous valid_to is the same as the current valid_from, then you can use lag() to identify where a new grouping starts. Then use a cumulative sum to calculate the grouping and finally aggregation:
select cust, act_dev, min(valid_from), max(valid_to)
from (select t.*,
sum(case when prev_valid_to = valid_from then 0 else 1 end) over (partition by cust order by valid_from) as grouping
from (select t.*,
lag(valid_to) over (partition by cust, act_dev order by valid_from) as prev_valid_to
from t
) t
) t
group by cust, act_dev, grouping;
Here is a db<>fiddle.
Related
I saw this question in one of the interviewing practice platforms and initially assumed its a gaps and islands problem, but I couldn't go ahead with that logic.
In ETL tool called Abinitio, there is a function called key_change using which we can solve these kind of problems. But, I am unable to do it in sql.
Could you please help me with this
Wow, I haven't heard "Ab Initio" in a long time. In any case, this does appear to be a gaps-and-islands problem, one that can be solved using the difference of row numbers. However, it can also be solved in a similar method which uses two lags and no aggregate:
select customer_id, status,
effective_date,
lead(effective_date, 1, '2099-12-31') over (partition by customer_id order by effective_date) as end_date
from (select t.*,
lag(status) over (partition by customer_id order by effective_date) as prev_status
from t
) t
where prev_status is null or prev_status <> status;
Basically, this picks up every row where there is a change of status and then uses the end date from the next row (where there is a change).
Your confusion over the nature of the problem may be due to how the data is presented. The dates are not ordered which can be confusing.
Been stuck on this issue and could really use a suggestion or help.
What I have in a table is basic user flow on a website. For every Session ID, there's a page visited from start (lands on homepage) to finish (purchase). This has been ordered by timestamp to get a count of pages visited during this process. This 'page count' has also been partitioned by Session ID to go back to 1 every time the ID changes.
What I need to do now is assign a step count (highlighted is what I'm trying to achieve). This should assign a similar count but doesn't continue counting at duplicate steps (ie, someone visited multiple product pages - it's multiple pages but still only one 'product view' step.
You'd think this would be done using a dense rank, partitioned by session id - but that's where I get stuck. You can't order on page count because that'll assign a unique number to each step count. You can't order by Step because that orders it alphabetically.
What could I do to achieve this?
Screenshot of desired outcome:
Many thanks!
Use lag to see if two values are the same then a cumulative sum:
select t.*,
sum(case when prev_cs = custom_step then 0 else 1 end) over (partition by session_id order by timestamp) as steps_count
from (select t.*,
lag(custom_step) over (partition by session_id order by timestamp) as prev_cs
from t
) t
Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(flag),
COUNTIF(IFNULL(flag, TRUE)) OVER(PARTITION BY session_id ORDER BY timestamp) AS steps_count
FROM (
SELECT *,
custom_step != LAG(custom_step) OVER(PARTITION BY session_id ORDER BY timestamp) AS flag
FROM `project.dataset.table`
)
-- ORDER BY timestamp
How can I do a SUMIF function so that it adds up values when the value in another column is "False", but then when it hits a value that is "True", it restarts the count over again, but includes the value of the first "True" encounter in the SUM calculation? I would also like it so that it adds up the value in chronological order.
I did some research and I think I need to use an over partition and make a row number column to call all row number = "1", but I'm not sure how to do this.
Edit: the Sum should also include the "distance" value for the first "true" value it encounters
Edit 2: Ultimately, I am trying to calculate the average distance each vehicle travels before an Alert is triggered to "True" which means it needs to be taken to the shop to be fixed. Perhaps there is a better way to do this than what I was originally thinking?
Sorry for the poor phrasing...
You want to define groups. It sounds like you want the definition to be the number of "trues" up to and including a given row. Then, you can do a cumulative sum within each group. So:
select t.*,
sum(distance) over (partition by vehicleid, grp
order by date
rows between unbounded preceding and current row
)
from (select t.*,
sum(case when alert = 'True' then 1 else 0 end) over
(partition by vehicleid
order by date
rows between unbounded preceding and current row
) as grp
from t
) t;
Here is a db<>fiddle that illustrates that this code works.
You are right in thinking that you can use SUM analytical function. Something like this will do the cumulative sum for you.
For you to restart the SUM when the alert is True, you include the alert in the partition window and Order by date to achieve the order.
SELECT SUM(CASE WHEN alert = 'FALSE'
THEN distance
ELSE 0
END)
OVER(PARTITION BY alert
ORDER BY date) cumm_sum
, date
, alert
FROM Table
I have been working with window functions a fair amount but I don't think I understand enough about how they work to answer why they behave the way they do.
For the query that I was working on (below), why am I required to take my aggregated field and add it to the group by? (In the second half of my query below I am unable to produce a result if I don't include "Events" in my second group by)
With Data as (
Select
CohortDate as month
,datediff(week,CohortDate,EventDate) as EventAge
,count(distinct case when EventDate is not null then GUID end) as Events
From MyTable
where month >= [getdate():month] - interval '12 months'
group by 1, 2
order by 1, 2
)
Select
month
,EventAge
,sum(Events) over (partition by month order by SubAge asc rows between unbounded preceding and current row) as TotEvents
from data
group by 1, 2, Events
order by 1, 2
I have run into this enough that I have just taken it for granted, but would really love some more color as to why this is needed. Is there a way I should be formatting these differently in order to avoid this (somewhat non-intuitive) requirement?
Thanks a ton!
What you are looking for is presumably a cumulative sum. That would be:
select month, EventAge,
sum(sum(Events)) over (partition by month
order by SubAge asc
rows between unbounded preceding and current row
) as TotEvents
from data
group by 1, 2
order by 1, 2 ;
Why? That might be a little hard to explain. Perhaps if you see the equivalent version with a subquery it will be clearer:
select me.*
sum(sum_events) over (partition by month
order by SubAge asc
rows between unbounded preceding and current row
) as TotEvents
from (select month, EventAge, sum(events) as sum_events
from data
group by 1, 2
) me
order by 1, 2 ;
This is pretty much an exactly shorthand for the query. The window function is evaluated after aggregation. You want to sum the SUM of the events after the aggregation. Hence, you need sum(sum(events)). After the aggregation, events is no longer available.
The nesting of aggregation functions is awkward at first -- at least it was for me. When I first started using window functions, I think I first spent a few days writing aggregation queries using subqueries and then rewriting without the subqueries. Quickly, I got used to writing them without subqueries.
My goal is to build an hourly count for records that have a start date/time and an end date/time. The actual records are never more than 24 hours from start to finish but many times are less. It works if I bounce every record against my "clock" which has 24 slots for every date up to "today". But it can take forever to run as there can be 2000 records in a day.
This is the detail I get:
The date/times in green are what I want as the start date/time for a group. The blue date/times are what I want as the end date time for the group.
Like this:
I have tried partitioning but because, in the second pic, the 4th row has the same values as the 2nd row, it groups them together even though there is a time span between them - the third row.
This is a gaps-and-islands problem. The start and end dates match on adjacent rows, so a difference of row numbers seems sufficient:
select id, min(startdatetime), max(enddatetime),
d_id, class, location
from (select t.*,
row_number() over (partition by id order by startdatetime) as seqnum,
row_number() over (partition by id, d_id, class, location) as seqnum_2
from t
) t
group by id, d_id, class, location, (seqnum - seqnum_2);
order by id, min(startdatetime);