Group a sub-set of a result set by time interval

Group a sub-set of a result set by time interval - sql

I have an audit table where specific actions are being recorded (such as 'access', 'create', 'update' and so on). I am selecting these records so that they can be displayed in a table to the administrative user.
This works fine when I select all the records for a particular entity. However, because I am using the Post-Redirect-Get pattern, the 'access' records are being logged on every page view. In a typical session an end user may hit the same page 6 or 7 times in the same 5 minute window. As a consequence, the administrative user is having to scroll through quite a few redundant access records and this is understandably distracting from the user experience.
To solve this problem, I have written two queries. The first will look for all records that are not access records. The second will look for access records and then groups them into ten minute intervals. I then UNION these two queries and order by the datetime.
-- Select non 'access' records
SELECT
[ORIGIN_ID]
,[ORIGIN_ID_TYPE]
,[REFERENCE_ID]
,[REFERENCE_ID_TYPE]
,[ACTION_TYPE_ID]
,CAST([ORIGINAL_VALUE] AS VARCHAR(8000)) AS ORIGINAL_VALUE
,CAST([CHANGED_VALUE] AS VARCHAR(8000)) AS CHANGED_VALUE
,[CREATED_BY]
,[CREATED_ON]
FROM [HISTORY]
WHERE [ORIGIN_ID] = 500 AND [ORIGIN_ID_TYPE] = 4 AND [ACTION_TYPE_ID] != 1
UNION
-- Select 'access' records and group them into 10 minute intervals by ts
SELECT
[ORIGIN_ID]
,[ORIGIN_ID_TYPE]
,[REFERENCE_ID]
,[REFERENCE_ID_TYPE]
,[ACTION_TYPE_ID]
,CAST([ORIGINAL_VALUE] AS VARCHAR(255)) AS ORIGINAL_VALUE
,CAST([CHANGED_VALUE] AS VARCHAR(255)) AS CHANGED_VALUE
,[CREATED_BY]
,DATEADD(MINUTE, DATEDIFF(MINUTE, 0, [CREATED_ON]) / 10 * 10, 0) AS CREATED_ON
FROM [HISTORY]
WHERE [ACTION_TYPE_ID] = 1 AND [ORIGIN_ID] = 500 AND [ORIGIN_ID_TYPE] = 4
GROUP BY
[ORIGIN_ID]
,[ORIGIN_ID_TYPE]
,[REFERENCE_ID]
,[REFERENCE_ID_TYPE]
,[ACTION_TYPE_ID]
,CAST([ORIGINAL_VALUE] AS VARCHAR(255))
,CAST([CHANGED_VALUE] AS VARCHAR(255))
,[CREATED_BY]
,DATEADD(MINUTE, DATEDIFF(MINUTE, 0, [CREATED_ON]) / 10 * 10, 0)
ORDER BY [CREATED_ON] DESC
SQLFiddle (I had a limited amount of data SQLFiddle would allow me to upload)
I feel like there may be a better way to do this that does not require me to use UNION. In order to do it this way I had to cast my TEXT columns to VARCHAR columns and I feel like there could be a better alternative. Any suggestions as to how this query can be improved?

Eliminate the union by using these two groupings. The second one also becomes the new expression for the combined created_on column. The first can also be used to control sorting and then otherwise discarded. (Don't forget to remove the filter on action_type_id too.):
case when action_type_id <> 1 then 1 else 2 end,
case when action_type_id <> 1
then created_on
else DATEADD(MINUTE, DATEDIFF(MINUTE, 0, [CREATED_ON]) / 10 * 10, 0)
end
This will cause the query to treat the two types of actions as distinct for purposes of aggregation. Since you do want to keep every row with a non-1 action, you don't collapse those into ten-minute blocks at all.
Note that this wouldn't quite work as is if it's possible to have two such rows recorded with the same timestamp. You'd need to group on another ID (or just row_number()) to get around that but I suspect that'll be unnecessary.

Related

SQL Query GROUP BY with same values over multiple columns and return SUM of the relative time value

I have pgAdmin 4.16.
The database contains a table called flights. In this table every row represents a flight. When a flight is delayed, the delay codes are used to describe the reason of the delay. Per delay code there is a time delay describing for how long the delay was. A delay can contain up to 3 delay codes with its relative delay time. I can group the delay codes per only 1 set of columns (delay code and delay time), but not over all 3 columns. Here is the script:
SELECT delay_code_1, COUNT(delay_code_1), AVG(delay_time_1), SUM(delay_time_1)
FROM flights
GROUP BY delay_code_1
ORDER BY SUM(delay_time_1) DESC
Here is the flights table:
Here is the desired result:
my sincere thanks

This problem occurs because the table is not in normal form - repeating groups should be broken out into another table. If this is your schema, you might redisign.
But, assuming you cannot change the schema, one solution would be to union three passes at the table, e.g.
SELECT delay_code, SUM(delay_time) as Total_Time
FROM
(
SELECT delay_code_1 as delay_code, delay_time_1 as delay_time
FROM flight
WHERE delay_code_1 is not null
UNION ALL
SELECT delay_code_2 as delay_code, delay_time_2 as delay_time
FROM flight
WHERE delay_code_2 is not null
UNION ALL
SELECT delay_code_3 as delay_code, delay_time_3 as delay_time
FROM flight
WHERE delay_code_3 is not null
)
GROUP BY delay_code
HTH
EDITED - as mentioned by #a_horse_with_no_name, UNION ALL should be used here. Plain UNION de-dupes the results, so Total_Time would be wrong if there were multiple delays of the same code and time.

A trigger to create populate a table based on other table

Ok, since it seems that my last two questions (this one and this one) only lead to confussion, I will try to explain the FULL problem here, so it might be a long post.
I'm trying to create a database for a trading system. The database has 2 main tables. One is table "Ticks" and the other is "Candles". As shown in the figure, each table has its own attributes..
Candles, bars or ohlc are the same thing.
The way a candle is seen in a chart is like this:
Candles are just a way to representate aggregated data, nothing more.
There are many ways to aggregate ticks in order to create one candle. In this post, I'm asking for a particular way that is creating one candle every 500 ticks. So, if the ticks table has 1000 ticks, I can create only 2 candles. If it has 500 ticks, I can create 1 candle. If it has 5000 ticks, I can create 10 candles. If there are 5001 ticks I still have only 10 candles, because I'm missing the other 499 ticks in order to create the 11th candle.
Actually, I'm storing all the ticks using a python script and creating (and therefore, inserting in the candles table) candles with another python script. This is a real time process.
Both scripts run in a while True: loop. No, I can't (read shouldn't) stop the scripts because the market is opened 24 hours - 5 days a week.
What I'm trying to do is to get rid of the python script that creates and stores the candles in the candles table. Why? Because I think that it will improve performance. Instead of doing multiple queries to know the amount of ticks that are available to create a new candle, I think that a trigger could handle it in a more efficient way (please, if I'm mistaken correct me).
I don't know how to actually solve it, but what I'm trying is to do this (thanks to #GordonLinoff for helping me in previous questions):
do $$
begin
with total_ticks as (
select count(*) c from (
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc) totals),
ticks_for_candles as(
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc
), candles as(
select max(date) as date,
max(bid) filter (where mod(seqnum, 500) = 1) as open,
max(bid) as high,
min(bid) as low,
max(bid) filter (where mod(seqnum, 500) = 500-1) as close,
max(ask) filter (where mod(seqnum, 500) = 500-1) as ask
from (
select t.*, row_number() over (order by date) as seqnum
from (select * from ticks_for_candles) t) as a
group by floor((seqnum - 1) /500)
having count(*) = 500
)
case 500<(select * from total_ticks)
when true then
return select * from candles
end;
end $$;
Using this, I get this error:
ERROR: syntax error at or near "case"
LINE 33: case 500<(select * from total_ticks)
^
SQL state: 42601
Character: 945
As you can see, there is no select after the CETs. If I put:
select case 500<(select * from total_ticks)
when true then
return select * from candles
end;
end $$;
I get this error:
ERROR: subquery must return only one column
LINE 31: (select * from candles)
^
QUERY: with total_ticks as (
select count(*) c from (
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc) totals),
ticks_for_candles as(
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc
), candles as(
select max(date) as date,
max(bid) filter (where mod(seqnum, 500) = 1) as open,
max(bid) as high,
min(bid) as low,
max(bid) filter (where mod(seqnum, 500) = 500-1) as close,
max(ask) filter (where mod(seqnum, 500) = 500-1) as ask
from (
select t.*, row_number() over (order by date) as seqnum
from (select * from ticks_for_candles) t) as a
group by floor((seqnum - 1) /500)
having count(*) = 500
)
select case 1000>(select * from total_ticks)
when true then
(select * from candles)
end
CONTEXT: PL/pgSQL function inline_code_block line 4 at SQL statement
SQL state: 42601
So honestly, I don't know how to do it correctly. It doesn't has to be with the actual code I provide here, but the desired output looks as follows:
-----------------------------------------------------------------------------------
| date | open | high | low | close | ask |
|2020-05-01 20:39:27.603452| 1.0976 | 1.09766 | 1.09732 | 1.09762 | 1.09776 |
This would be the output when there is enough ticks to create only 1 candle. If there is enough to create two of them, then there should be 2 rows.
So, at the end of the day, what I have in mind is that the trigger should check constantly if there is enough data to create a candle and if it is, then create it.
Is this a good idea or I should stick to the python script?
Can this be achieved with my approach?
What I'm doing wrong?
What should I do and how should I manage this situation?
I really hope that the question now is complete and there is no missing information.
All comments and advices are appreciated.
Thanks!
EDIT: Since this is a real time process, in one second there could be 499 ticks in the database and in the next second there could be 503 ticks. This means that 4 ticks arrived in 1 second.

Being a database guy, my approach would be to use triggers in the database.
Create a third table candle_in_the_making that contains the data from the ticks that have not yet been aggregated to a candles entry.
Create an INSERT trigger on the ticks table (doesn't matter if BEFORE or AFTER) that does the following:
For every tick inserted, add a row to candle_in_the_making.
If the row count reaches 500, compute and insert a new candles row and TRUNCATE candle_in_the_making.
This is simple if ticks are inserted only in a single thread.
If ticks are inserted concurrently, you have to find a way to prevent two threads from inserting the 500th tick in candle_in_the_making at the same time (so that you end up with 501 entries). I can think of two ways to do that in the database:
Have an extra table c_i_m_count that contains only a single number, which is the number of rows in candle_in_the_making. Before you insert into candle_in_the_making, you run the atomic
UPDATE c_i_m_count SET counter = counter + 1 RETURNING counter;
This locks the row, so that any two INSERTs into counter_in_the_making are effectively serialized.
Use advisory locks to serialize the inserting threads. In particular, a transaction level exclusive lock as taken by pg_advisory_xact_lock would be indicated.

How can I modify this SQL query to exclude all results except from the previous two hours?

We're currently using SQL Express on SQL Server 2005 and want to set up an automated ftp file transfer every two hours to our client. We want to be able to send them bi-hourly uploads without duplicates throughout the day. Is this possible to do by modifying this existing query?
Use Sweet
select distinct d.AccountCode, f.ProcessedFileName, f.CallStartDateTime, f.PathToFile
from CSR_CallDetail d, CSR_FileListing f
where d.CallId = f.CallId
and f.ProcessedFileName like '%mp3'
and f.CallStartDateTime between convert(varchar(10),getdate()-1,101) and convert(varchar(10),getdate(),101)
and d.AccountCode > '740000'
and f.AccountCode > '740000'
and not exists (select 1 from( select processedfilename from csr_filelisting) p
where f.compressedfilename = p.processedfilename)
Here's the updated query
Use Sweet
select distinct d.AccountCode, f.ProcessedFileName, f.CallStartDateTime, f.PathToFile
from CSR_CallDetail d, CSR_FileListing f
where d.CallId = f.CallId
and f.ProcessedFileName like '%mp3'
and DATEDiff(hh, f.callstartdatetime, GETDATE ()) <=2
and d.AccountCode > '740000'
and f.AccountCode > '740000'
and not exists (select 1 from( select processedfilename from csr_filelisting) p where f.compressedfilename = p.processedfilename)

Let's say the query you posted returns desired result. If so, we need a date (and time) the records have been saved. All you need is to add condition:
AND DATEDIFF(hh, date_of_record, GETDATE()) <=2
I assume in your case it will be:
AND DATEDIFF(hh, f.CallStartDateTime , GETDATE()) <=2

You won't be able to rely on timestamps to get guaranteed exact sequential nonoverlapping sets of anything. You'll always be up against a race condition. What you should do is add a bit column somewhere that will mean you've already processed that row, and set it appropriately at the time of processing. Use transactions and isolation levels to ensure that no one is updating it while you're working on it (a brief moment, one hopes).

SQL Left Outer Join doesn't give full table

I have searched for similar issues but I found nothing...
I have a problem with joining 2 tables in SQL. First table is created using following procedure
DECLARE #Sequence TABLE (n DATETIME NOT NULL)
DECLARE #Index INT
SET #Index = 1
WHILE #Index <= 96 BEGIN
INSERT #Sequence (n) VALUES (DATEADD(Minute, #Index * 5, DATEADD(Hour, -8, '06-25-2010 00:00:00')))
SET #Index = #Index + 1
END
And when I run a regular query like such:
SELECT
Sequence.n
FROM
#Sequence as Sequence
I get what I expect - 96 rows with DateTime values spaced by 5 minute intervals, ending 06-25-2010 00:00:00 (this value will later be substituted by parameter in SSRS Report, thats why it might look weird to specify 'end' of the range and use double DATEADD).
Now the second table contains values registered by PLMC controllers, and they also happen to register one record per 5 minutes (per pointID, but not to complicate it lets assume there is one PointID).
The problem is that sometimes the controller goes offline, and for a given point there is no value registered, not even a 0. Thats why if I want to get a full picture of last 8 hours from any given point I came up with this sequence table. So, if I take the values from the other table in the same time range using this query:
SELECT
DataTime, DataValue
FROM
PointValue
WHERE
PointValue.DataTime > DATEADD(Hour, -8, '06-22-2010 00:00:00')
AND PointValue.DataTime < '06-22-2010 00:00:00'
AND PointID = '5284'
I will get only 56 rows. This is because after 20:30 on that day the controller went offline and there are no records registered.
So here is the problem. I try to run this query to get one value for every 5 minutes, and hopefully still see the whole 96 rows (8 hours in 5min intervals) with null values after 20:30:
SELECT
Sequence.n,
PointValue.DataValue
FROM
#Sequence as Sequence LEFT OUTER JOIN PointValue
ON Sequence.n = PointValue.DataTime
WHERE
PointID = '5280'
Unforetunately the result is still 56 rows, with the same time stamps as the last query... I have tried every possible join and cannot get the Null values for the last 3,5h of the day. I'm sure I'm making a very simple mistake but I have tried different ways to solve it for hours now and I seriously need help.
SOLVED:
Thanks for your comments, I have started to modify the query after reading first comment to appear, by Blorgbeard. All I had to do is make a carthesian multiplication of my time sequence x all relevan pointIDs (since I don't need all), so as a result my 'FROM' looks as follows:
(SELECT Sequence.n, dbo.LABELS.theIndex FROM #TimeSequence as Sequence, dbo.LABELS
WHERE LEFT(dbo.LABELS.theLabel,2)='VA') as BaseTable
LEFT OUTER JOIN PointValue ON BaseTable.n = PointValue.DataTime AND BaseTable.theIndex = PointValue.PointID
Again, thank you for your help!

You need to move the PointID comparison into the ON clause - with it in the WHERE clause, you're forcing the LEFT JOIN to become an INNER JOIN:
ON Sequence.n = PointValue.DataTime AND
PointValue.PointID = '5280'
Conditions in the WHERE clause have to be met by every row in the result set.

I think the problem is:
WHERE
PointID = '5280'
Because PointID is in the PointValue table, so it will be null for missing rows, and null does not equal '5280'.
I think you could change it like this to make it work:
WHERE
(PointID is null) or (PointID = '5280')

This is because, in your NULL rows, PointID is not '5280'.
Try adding that to your JOIN clause...
ON Sequence.n = PointValue.DataTime AND PointID = '5280'

Distribution of table in time

I have a MySQL table with approximately 3000 rows per user. One of the columns is a datetime field, which is mutable, so the rows aren't in chronological order.
I'd like to visualize the time distribution in a chart, so I need a number of individual datapoints. 20 datapoints would be enough.
I could do this:
select timefield from entries where uid = ? order by timefield;
and look at every 150th row.
Or I could do 20 separate queries and use limit 1 and offset.
But there must be a more efficient solution...

Michal Sznajder almost had it, but you can't use column aliases in a WHERE clause in SQL. So you have to wrap it as a derived table. I tried this and it returns 20 rows:
SELECT * FROM (
SELECT #rownum:=#rownum+1 AS rownum, e.*
FROM (SELECT #rownum := 0) r, entries e) AS e2
WHERE uid = ? AND rownum % 150 = 0;

Something like this came to my mind
select #rownum:=#rownum+1 rownum, entries.*
from (select #rownum:=0) r, entries
where uid = ? and rownum % 150 = 0
I don't have MySQL at my hand but maybe this will help ...

As far as visualization, I know this is not the periodic sampling you are talking about, but I would look at all the rows for a user and choose an interval bucket, SUM within the buckets and show on a bar graph or similar. This would show a real "distribution", since many occurrences within a time frame may be significant.
SELECT DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket -- choose an appropriate granularity (days used here)
,COUNT(*)
FROM entries
WHERE uid = ?
GROUP BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
ORDER BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
Or if you don't like the way you have to repeat yourself - or if you are playing with different buckets and want to analyze across many users in 3-D (measure in Z against x, y uid, bucket):
SELECT uid
,bucket
,COUNT(*) AS measure
FROM (
SELECT uid
,DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket
FROM entries
) AS buckets
GROUP BY uid
,bucket
ORDER BY uid
,bucket
If I wanted to plot in 3-D, I would probably determine a way to order users according to some meaningful overall metric for the user.

#Michal
For whatever reason, your example only works when the where #recnum uses a less than operator. I think when the where filters out a row, the rownum doesn't get incremented, and it can't match anything else.
If the original table has an auto incremented id column, and rows were inserted in chronological order, then this should work:
select timefield from entries
where uid = ? and id % 150 = 0 order by timefield;
Of course that doesn't work if there is no correlation between the id and the timefield, unless you don't actually care about getting evenly spaced timefields, just 20 random ones.

Do you really care about the individual data points? Or will using the statistical aggregate functions on the day number instead suffice to tell you what you wish to know?
AVG
STDDEV_POP
VARIANCE
TO_DAYS

select timefield
from entries
where rand() = .01 --will return 1% of rows adjust as needed.
Not a mysql expert so I'm not sure how rand() operates in this environment.

For my reference - and for those using postgres - Postgres 9.4 will have ordered set aggregates that should solve this problem:
SELECT percentile_disc(0.95)
WITHIN GROUP (ORDER BY response_time)
FROM pageviews;
Source: http://www.craigkerstiens.com/2014/02/02/Examining-PostgreSQL-9.4/

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Group a sub-set of a result set by time interval - sql

Related

SQL Query GROUP BY with same values over multiple columns and return SUM of the relative time value

A trigger to create populate a table based on other table

How can I modify this SQL query to exclude all results except from the previous two hours?

SQL Left Outer Join doesn't give full table

Distribution of table in time

Categories

Resources