Apply condition between rows of the same column - sql

I have a table where the first column is time, which increases by 1 second increments.
The second column brings the code that started the day in its first line. This could be any between 1-5. The 0 (zero) value indicates the code didn't change. When the code changes, it is indicated the time it changed and which number it changed to (thus, Event), but for as long as it stays the same it will be shown 0 (zero) again.
My intent is to make a new column specifying the present code at any time. So far, I have been doing this in Excel, with the following conditions and results:
Is there a way for this Excel condition to be applied in a query to create this new column?
I've been testing CASE WHEN statements, and I tried to implement Lag or Lead functions in it. But so far none of them worked to apply the same value of the previous row when the event is 0 (zero).

If event is always increasing, you can use a window max():
select time, event, max(event) over(order by time) code
from mytable
Else, it is a bit more complicated. One option is to build the groups with a window sum:
select time, event, max(event) over(partition by grp order by time) code
from (
select
t.*,
sum(case when event > 0 then 1 else 0 end) over(order by time) grp
from mytable t
) t

Related

Count reset to 0 if reached a certain condition

The following example demonstrates the case:
Following is the sample data:
Following is the output expectation (note that there are more than 1 entities in the 'entity' column):
is_hit is defined when variable a is <=4
variable_a is defined if the total hit from the past days have reached 3
What I have to do is to tag whether the entity has a cumulative hit reached a total count of 3. Once the entity is tagged, the hit count should reset to 0 again.
By following this logic, looking at the demonstration above the Entity A tagged on 4th June and 9th June.
Currently, my issue is applying the is_tagged logic to the query. Is there a way to do this in SQL?
If I understand correctly, you want row_number():
select t.*,
(case when is_tagged and
mod(row_number() over (partition by entity, is_tagged
order by date
),
3) = 0
then true
end)
from t;
Note: This assumes that your columns are booleans. If they are strings then use 'true'.

SUMIF then restart count

How can I do a SUMIF function so that it adds up values when the value in another column is "False", but then when it hits a value that is "True", it restarts the count over again, but includes the value of the first "True" encounter in the SUM calculation? I would also like it so that it adds up the value in chronological order.
I did some research and I think I need to use an over partition and make a row number column to call all row number = "1", but I'm not sure how to do this.
Edit: the Sum should also include the "distance" value for the first "true" value it encounters
Edit 2: Ultimately, I am trying to calculate the average distance each vehicle travels before an Alert is triggered to "True" which means it needs to be taken to the shop to be fixed. Perhaps there is a better way to do this than what I was originally thinking?
Sorry for the poor phrasing...
You want to define groups. It sounds like you want the definition to be the number of "trues" up to and including a given row. Then, you can do a cumulative sum within each group. So:
select t.*,
sum(distance) over (partition by vehicleid, grp
order by date
rows between unbounded preceding and current row
)
from (select t.*,
sum(case when alert = 'True' then 1 else 0 end) over
(partition by vehicleid
order by date
rows between unbounded preceding and current row
) as grp
from t
) t;
Here is a db<>fiddle that illustrates that this code works.
You are right in thinking that you can use SUM analytical function. Something like this will do the cumulative sum for you.
For you to restart the SUM when the alert is True, you include the alert in the partition window and Order by date to achieve the order.
SELECT SUM(CASE WHEN alert = 'FALSE'
THEN distance
ELSE 0
END)
OVER(PARTITION BY alert
ORDER BY date) cumm_sum
, date
, alert
FROM Table

SQL Server Determine the Amount of Time Above a Threshold

I am trying to determine the amount of time my data spends above a certain threshold. I have a SQL table of values that looks like this:
Where the first column is datetime and the second column is value. This is time series data so it is a large table and cannot be changed. I want to know the first value that crosses over the threshold (say it is 50 for the example) this is my beginning, the last value that crosses back over the threshold which is the end, and the duration spent over the threshold.
In my data example the Beginning would be 9/20/2019 19:18, the end would be 9/20/2019 19:46 and the duration would be 28 minutes.
This needs to be written in one sql statement due to the requirements of the project. I am just wondering if this is possible and how to do it. Thanks!
You can use lead() and some aggregation:
select t.*
from (select t.*,
datediff(minute,
ts, lead(ts) over (order by ts)
) as diff_minutes
from (select t.*,
lead(value) over (order by ts) as next_value
from t
) t
where (value < 50 and next_value >= 50) or
(value >= 50 and next_value < 50
) t
where value < 50;
Your question is a little tricky because you want the time span to start just before the period in question. That is actually a simplification. The above implements:
Identify the next value.
Keep a row when next_value or current value exceeds the threshold or vice versa. This is the first row before and last row after the period.
Then use lead() to get the ending timestamp.
Finally filter down to just the first row.
Another approach is perhaps simpler. Define the groups based on the count of rows that are under the threshold up to or before the row. This keeps the previous row with the following group.
Then aggregate:
select min(ts), max(ts),
datediff(minute, min(ts), max(ts)) as diff_minute
from (select t.*,
sum(case when value < 50 then 1 else 0 end) over (order by ts) as grp
from t
) t
group by grp;
It looks like you are sampling every 10 seconds. If that is pretty solid, you can just count how many records are above 50 during a selected interval, and multiply by 10 seconds, that will be the duration that exceeds 50.

Selecting the first and last event per user, per day

I have a Google Analytics event which fires on my website when certain interactions are made, this may or may not fire for a user in a session, or can fire many times.
I'd like to return results showing the userID and the value of the first and last event label, per day. I have tried to do this with MAX(hits.eventInfo.eventLabel), but when I fact check my results this is not returning the last value for that user in the day as I was expecting.
SELECT Date,
customDimension.value AS UserID,
MAX(hits.eventInfo.eventLabel) AS last_value
FROM `project.dataset.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
CROSS JOIN UNNEST(t.customdimensions) AS customDimension
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
AND hits.eventInfo.eventAction = "Value"
AND customDimension.index = 2
GROUP BY Date, UserID
For example, the query above returns results where user X has the following MAX() value:
20180806 User_x 69.96
But when I look at the details of that users interactions on the day I see:
Based on this, I would expect to see 79.95 as my MAX() result as it has the highest hit number, instead I seem to have selected a value from somewhere in the middle of the session - how can I adjust my query to ensure I select the last event value?
When you are looking for maximum value of column colA while doing GROUP BY - obviously MAX(colA) will work
But when you are looking for value in column colA based on maximum value in column colB - you should use STRING_AGG(colA ORDER BY colB DESC LIMIT 1) or similar using ARRAY_AGG()
So, in you case, I think it will be something like below (you should tune it further)
STRING_AGG(eventInfo.eventLabel ORDER BY hiNumber DESC LIMIT 1) AS last_value
In your case one should work with subqueries on the hits array. This allows full control over what you want to have. I used the example ga data from Google, so labels are different. But I wrote it in a way you can easily modify to fit your needs:
SELECT
date,
fullvisitorid,
visitstarttime,
(SELECT value FROM t.customDimensions WHERE index=2) userId,
(SELECT
--STRUCT(hour, minute, hitNumber, eventinfo.eventlabel) -- for testing, comment out next line
eventInfo.eventLabel
FROM t.hits
WHERE type='EVENT' AND eventInfo.eventAction <> '' -- modify to fit your condition
ORDER BY hitNumber ASC LIMIT 1
) AS firstEventLabel,
(SELECT
--STRUCT(hour, minute, hitNumber, eventinfo.eventlabel) -- for testing, comment out next line
eventInfo.eventLabel
FROM t.hits
WHERE type='EVENT' AND eventInfo.eventAction <> '' -- modify to fit your condition
ORDER BY hitNumber DESC LIMIT 1
) AS lastEventLabel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` t
LIMIT 1000 -- for testing
Basically, I'm querying events order them by hitNumber ascending or descending and limit to one to only have one result per row. The line with userId also shows how to properly get a custom dimension value.
If you are very new to this concept of working with arrays you can learn all about it here: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays
MAX() should work. The one time it would return an unexpected value is if it is operating on a string, not a number.
Does this fix the problem?
MAX(CAST(hits.eventInfo.eventLabel as float128)) AS last_value

SQL query to identify 0 AFTER a 1

Let's say I have two columns: Date and Indicator
Usually the indicator goes from 0 to 1 (when the data is sorted by date) and I want to be able to identify if it goes from 1 to 0 instead. Is there an easy way to do this with SQL?
I am already aggregating other fields in the same table. If I can add this to as another aggregation (e.g. without using a separate "where" statement or passing over the data a second time) it would be pretty awesome.
This is the phenomena I want to catch:
Date Indicator
1/5/01 0
1/4/01 0
1/3/01 1
1/2/01 1
1/1/01 0
This isn't a teradata-specific answer, but this can be done in normal SQL.
Assuming that the sequence is already 'complete' and xn+1 can be derived from xn, such as when the dates are sequential and all present:
SELECT date -- the 1 on the day following the 0
FROM r curr
JOIN r prev
-- join each day with the previous day
ON curr.date = dateadd(d, 1, prev.date)
WHERE curr.indicator = 1
AND prev.indicator = 0
YMMV on the ability of such a query to use indexes efficiently.
If the sequence is not complete the same can be applied after making a delegate sequence which is well ordered and similarly 'complete'.
This can also be done using correlated subqueries, each selecting the indicator of the 'previous max', but.. uhg.
Joining the table against it self it quite generic, but most SQL Dialects now support Analytical Functions. Ideally you could use LAG() but TeraData seems to try to support the absolute minimum of these, and so so they point you to use SUM() combined with rows preceding.
In any regard, this method avoids a potentially costly join and effectively deals with gaps in the data, whilst making maximum use of indexes.
SELECT
*
FROM
yourTable t
QUALIFY
t.indicator
<
SUM(t.indicator) OVER (PARTITION BY t.somecolumn /* optional */
ORDER BY t.Date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
)
QUALIFY is a bit TeraData specific, but slightly tidier than the alternative...
SELECT
*
FROM
(
SELECT
*,
SUM(t.indicator) OVER (PARTITION BY t.somecolumn /* optional */
ORDER BY t.Date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
)
AS previous_indicator
FROM
yourTable t
)
lagged
WHERE
lagged.indicator < lagged.previous_indicator
Supposing you mean that you want to determine whether any row having 1 as its indicator value has an earlier Date than a row in its group having 0 as its indicator value, you can identify groups with that characteristic by including the appropriate extreme dates in your aggregate results:
SELECT
...
MAX(CASE indicator WHEN 0 THEN Date END) AS last_ind_0,
MIN(CASE indicator WHEN 1 THEN Date END) AS first_ind_1,
...
You then test whether first_ind_1 is less than last_ind_0, either in code or as another selection item.