Running Total Based on State Change and Date Difference

Running Total Based on State Change and Date Difference - sql

I'm wanting to do a running total based on State and Date Difference.
I have machines that enter a row of data every few milliseconds providing it's current state, (Off-line, Stopped, Ready, Active & Error), some machine data and a timestamp.
These machines can run for a few minutes or a few days so using a date range doesn't work for the current status duration.
an example of the data is:-
RowID, MachineID, Status, TimeStamp
1, Machine1, Active, 27/04/2022 10:00:00.050
I want to pick up the current status, which I do by picking up the Top 1 entry by the machineID and ordering by RowID Descending
If my current Status is Active I want to know how long its been in that state, the machine could have been active for a few minutes or a few days so using a date range doesn't work for me, I want to perform a date diff from the last entry to the first entry that the Status changed to Active.
All advise is welcomed and thanks for reading my post.

Related

Boolean conditions that span rows in Spark

I'm trying to calculate a boolean column based on a group and date range.
I have a table that records transactions with the following row structure:
Person GUID - Date - Payment Amount
There are multiple rows per person.
What I want is a new boolean column, called Recent that is determined by whether the person had a transaction within a time period of say, 3 days prior. It would be True if they have, False if they have not.
Any idea for a query to do this?

It depends on when the start time for the beginning of "prior" is. If it's "now" (the current time), then it's quite easy: you want to find the max date per person and then filter on that being no more than some distance from the current time.
Take a look at window functions in Spark and how they can be used with time series.
To find the max date you'll use an expression such as
max(Date) over (partition by Person) as max_date
Hope this helps.

How can I optimally modify this BiqQuery query to retrieve the latest available data

I have the following query. It initially performs a sub-select by querying on a table that is partitioned (sharded) by sample_date_time. It does this by filtering using a date range in the WHERE that is passed via parameters. Then the final SELECT selects the data to be returned.
The query currently returns data for the latest complete hour (from the beginning of the previous hours hourly boundary, to the end of it). I want to adapt it to instead to get the latest hour of data that contains any data sample, up to a maximum of approximately 5hrs ago. The query can't use anything that invalidates the BigQuery cache within any given hour (e.g. I can't use a date function that gets the current date). The table data only updates every hour.
I'm thinking maybe I need to select the max sample_date_time in the initial sub-select, over a range of the last 5 hours. I could pass the hourly end boundary of the current time as a parameter, but I'm not seeing how I can limit the date range for which to retrieve the MAX, then use that max to get the start and end dates of the most recent hour that has any data.
WITH data AS (
SELECT
created_date_time,
sample_date_time,
station,
channel,
value
FROM my.mart
WHERE sample_date_time BETWEEN '2019-07-23 04:00:00.000000+00:00' AND '2019-07-23 04:59:59.999000+00:00'
AND station = '[my_guid]'
)
SELECT sample_date_time, station, channel, value
FROM data
ORDER BY value desc, channel asc, sample_date_time desc

How to find where a total condition exist

I am trying to create a report that will show how long an automated sprinkler system has run for. The system is comprised of several sprinklers, with each one keeping track of only itself, and then sends that information to a database. My problem is that each sprinkler has its own run time (I.E. if 5 sprinklers all ran at the same time for 10 minutes, it would report back a total run time of 50 minutes), and I want to know only the net amount of run time - in this example, it would be 10 minutes.
The database is comprised of a time stamp and a boolean, where it records the time stamp every time a sprinkler is shut on or off (its on/off state is indicated by the 1/0 of the boolean).
So, to figure out the total net time the system was on each day - whether it was 1 sprinkler running or all of them - I need to check the database for time frames where no sprinklers were turned at all (or where ANY sprinkler at all was turned on). I would think the beginning of the query would look something like
SELECT * FROM MyTable
WHERE MyBoolean = 0
AND [ ... ]
But I'm not sure what the conditional statements that would follow the AND would be like to check the time stamps.
Is there a query I can send to the database that will report back this format of information?
EDIT:
Here's the table the data is recorded to - it's literally just a name, a boolean, and a datetime of when the boolean was changed, and that's the entire database

Every time a sprinkler turns on the number of running sprinklers increments by 1, and every time one turns off the number decrements by 1. If you transform the data so you get this:
timestamp on/off
07:00:05 1
07:03:10 1
07:05:45 -1
then you have a sequence of events in order; which sprinklers they refer to is irrelevant. (I've changed the zeros to -1 for reasons that will become evident in a moment. You can do this with "(2 * value) - 1")
Now put a running total together:
select a.timestamp, (SELECT SUM(a.on_off)
FROM sprinkler_events b
WHERE b.timestamp <= a.timestamp) as run_total
from sprinkler_events a
order by a.timestamp;
where sprinkler_events is the transformed data I listed above. This will give you:
timestamp run_total
07:00:05 1
07:03:10 2
07:05:45 1
and so on. Every row in this which has a run total of zeros is a time at which all sprinklers were turned off, which I think is what you're looking for. If you need to sum the time they were on or off, you'll need to do additional processing: search for "date difference between consecutive rows" and you'll see solutions for that.

You might consider looking for whether all the sprinklers are currently off. For example:
SELECT COUNT (DISTINCT s._NAME) AS sprinkers_currently_off
FROM (
SELECT
_NAME,
_VALUE,
_TIMESTAMP,
ROW_NUMBER() OVER (PARTITION BY _NAME ORDER BY _TIMESTAMP DESC, _VALUE) AS latest_rec
FROM sprinklers
) s
WHERE
_VALUE = 0
AND latest_rec = 1
The inner query orders the records so that you can get the latest status of all the sprinklers, and the outer query counts how many are currently off. If you have 10 sprinklers you would report them all off when this query returns 10.
You could modify this by applying a date range to the inner query if you wanted to look into the past, but this should get you on the right track.

storing data ranges - effective representation

I need to store values for every day in timeline, i.e. every user of database should has status assigned for every day, like this:
from 1.1.2000 to 28.05.2011 - status 1
from 29.05.2011 to 30.01.2012 - status 3
from 1.2.2012 to infinity - status 4
Each day should have only one status assigned, and last status is not ending (until another one is given). My question is what is effective representation in sql database? Obvious solution is to create row for each change (with the last day the status is assigned in each range), like this:
uptodate status
28.05.2011 status 1
30.01.2012 status 3
01.01.9999 status 4
this has many problems - if i would want to add another range, say from 15.02.2012, i would need to alter last row too:
uptodate status
28.05.2011 status 1
30.01.2012 status 3
14.02.2012 status 4
01.01.9999 status 8
and it requires lots of checking to make sure there is no overlapping and errors, especially if someone wants to modify ranges in the middle of the list - inserting a new status from 29.01.2012 to 10.02.2012 is hard to implement (it would require data ranges of status 3 and status 4 to shrink accordingly to make space for new status). Is there any better solution?
i thought about completly other solution, like storing each day status in separate row - so there will be row for every day in timeline. This would make it easy to update - simply enter new status for rows with date between start and end. Of course this would generate big amount of needless data, so it's bad solution, but is coherent and easy to manage. I was wondering if there is something in between, but i guess not.
more context: i want moderator to be able to assign status freely to any dates, and edit it if he would need to. But most often moderator will be adding new status data ranges at the end. I don't really need the last status. After moderator finishes editing whole month time, I need to generate raport based on status on each day in that month. But anytime moderator may want to edit data months ago (which would be reflected on updated raports), and he can put one status for i.e. one year in advance.

You seem to want to use this table for two things - recording the current status and the history of status changes. You should separate the current status out and move it up to the parent (just like the registered date)
User
===============
Registered Date
Current Status
Status History
===============
Uptodate
Status

Your table structure should include the effective and end dates of the status period. This effectively "tiles" the statuses into groups that don't overlap. The last row should have a dummy end date (as you have above) or NULL. Using a value instead of NULL is useful if you have indexes on the end date.
With this structure, to get the status on any given date, you use the query:
select *
from t
where <date> between effdate and enddate
To add a new status at the end of the period requires two changes:
Modify the row in the table with the enddate = 01/01/9999 to have an enddate of yesterday.
Insert a new row with the effdate of today and an enddate of 01/01/9999
I would wrap this in a stored procedure.
To change a status on one date in the past requires splitting one of the historical records in two. Multiple dates may require changing multiple records.
If you have a date range, you can get all tiles that overlap a given time period with the query:
select *
from t
where <periodstart> <= enddate and <periodend> >= effdate

SQL - state machine - reporting on historical data based on changeset

I want to record user states and then be able to report historically based on the record of changes we've kept. I'm trying to do this in SQL (using PostgreSQL) and I have a proposed structure for recording user changes like the following.
CREATE TABLE users (
userid SERIAL NOT NULL PRIMARY KEY,
name VARCHAR(40),
status CHAR NOT NULL
);
CREATE TABLE status_log (
logid SERIAL,
userid INTEGER NOT NULL REFERENCES users(userid),
status CHAR NOT NULL,
logcreated TIMESTAMP
);
That's my proposed table structure, based on the data.
For the status field 'a' represents an active user and 's' represents a suspended user,
INSERT INTO status_log (userid, status, logcreated) VALUES (1, 's', '2008-01-01');
INSERT INTO status_log (userid, status, logcreated) VALUES (1, 'a', '2008-02-01');
So this user was suspended on 1st Jan and active again on 1st of February.
If I wanted to get a suspended list of customers on 15th January 2008, then userid 1 should show up. If I get a suspended list of customers on 15th February 2008, then userid 1 should not show up.
1) Is this the best way to structure this data for this kind of query?
2) How do I query the data in either this structure or in your proposed modified structure so that I can simply have a date (say 15th January) and find a list of customers that had an active status on that date in SQL only? Is this a job for SQL?

This can be done, but would be a lot more efficient if you stored the end date of each log. With your model you have to do something like:
select l1.userid
from status_log l1
where l1.status='s'
and l1.logcreated = (select max(l2.logcreated)
from status_log l2
where l2.userid = l1.userid
and l2.logcreated <= date '2008-02-15'
);
With the additional column it woud be more like:
select userid
from status_log
where status='s'
and logcreated <= date '2008-02-15'
and logsuperseded >= date '2008-02-15';
(Apologies for any syntax errors, I don't know Postgresql.)
To address some further issues raised by Phil:
A user might get moved from active, to suspended, to cancelled, to active again. This is a simplified version, in reality, there are even more states and people can be moved directly from one state to another.
This would appear in the table like this:
userid from to status
FRED 2008-01-01 2008-01-31 s
FRED 2008-02-01 2008-02-07 c
FRED 2008-02-08 a
I used a null for the "to" date of the current record. I could have used a future date like 2999-12-31 but null is preferable in some ways.
Additionally, there would be no "end date" for the current status either, so I think this slightly breaks your query?
Yes, my query would have to be re-written as
select userid
from status_log
where status='s'
and logcreated <= date '2008-02-15'
and (logsuperseded is null or logsuperseded >= date '2008-02-15');
A downside of this design is that whenever the user's status changes you have to end date their current status_log as well as create a new one. However, that isn't difficult, and I think the query advantage probably outweighs this.

Does Postgres support analytic queries? This would give the active users on 2008-02-15
select userid
from
(
select logid,
userid,
status,
logcreated,
max(logcreated) over (partition by userid) max_logcreated_by_user
from status_log
where logcreated <= date '2008-02-15'
)
where logcreated = max_logcreated_by_user
and status = 'a'
/

#Tony the "end" date isn't necessarily applicable.
A user might get moved from active, to suspended, to cancelled, to active again. This is a simplified version, in reality, there are even more states and people can be moved directly from one state to another.
Additionally, there would be no "end date" for the current status either, so I think this slightly breaks your query?

#Phil
I like Tony's solution. It seems to most approriately model the situation described. Any particular user has a status for a given period of time (a minute, an hour, a day, etc.), but it is for a duration, not an instant in time. Since you want to know who was active during a certain period of time, modeling the information as a duration seems like the best approach.
I am not sure that additional statuses are a problem. If someone is active, then suspended, then cancelled, then active again, each of those statuses would be applicable for a given duration, would they not? It may be a vey short duration, such as a few seconds or a minute, but they would still be for a length of time.
Are you concerned that a person's status can change multiple times in a given day, but you want to know who was active for a given day? If so, then you just need to more specifically define what it means to be active on a given day. If it is enough that they were active for any part of that day, then Tony's answer works well as is. If they would have to be active for a certain amount of time in a given day, then Tony's solution could be modified to simply determine the length of time (in hours, or minutes, or days), and adding further restrictions in the WHERE clause to retrieve for the proper date, status, and length of time in that status.
As for there being no "end date" for the current status, that is no problem either as long as the end date were nullable. Simply use something like this "WHERE enddate <= '2008-08-15' or enddate is null".

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas