Oracle GROUP BY similar timestamps? - sql

I have an activity table with a structure like this:
id prd_id act_dt grp
------------------------------------
1 1 2000-01-01 00:00:00
2 1 2000-01-01 00:00:01
3 1 2000-01-01 00:00:02
4 2 2000-01-01 00:00:00
5 2 2000-01-01 00:00:01
6 2 2000-01-01 01:00:00
7 2 2000-01-01 01:00:01
8 3 2000-01-01 00:00:00
9 3 2000-01-01 00:00:01
10 3 2000-01-01 02:00:00
I want to split the data within this activity table by product (prd_id) and activity date (act_dt), and update the the group (grp) column with a value from a sequence for each of these groups.
The kicker is, I need to group by similar timestamps, where similar means "all records have a difference of exactly 1 second." In other words, within a group, the difference between any 2 records when sorted by date will be exactly 1 second, and the difference between the first and last records can be any amount of time, so long as all the intermediary records are 1 second apart.
For the example data, the groups would be:
id prd_id act_dt grp
------------------------------------
1 1 2000-01-01 00:00:00 1
2 1 2000-01-01 00:00:01 1
3 1 2000-01-01 00:00:02 1
4 2 2000-01-01 00:00:00 2
5 2 2000-01-01 00:00:01 2
6 2 2000-01-01 01:00:00 3
7 2 2000-01-01 01:00:01 3
8 3 2000-01-01 00:00:00 4
9 3 2000-01-01 00:00:01 4
10 3 2000-01-01 02:00:00 5
What method would I use to accomplish this?
The size of the table is ~20 million rows, if that affects the method used to solve the problem.

I'm not an Oracle wiz, so I'm guessing at the best option for one line:
(CAST('2010-01-01' AS DATETIME) - act_dt) * 24 * 60 * 60 AS time_id,
This just needs to be "the number of seconds from [aDateConstant] to act_dt". The result can be negative. It just needs to be a the number of seconds, to turn your act_dt into an INT. The rest should work fine.
WITH
sequenced_data
AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY prd_id ORDER BY act_dt) AS sequence_id,
(CAST('2010-01-01' AS DATETIME) - act_dt) * 24 * 60 * 60 AS time_id,
*
FROM
yourTable
)
SELECT
DENSE_RANK() OVER (PARTITION BY prd_id ORDER BY time_id - sequence_id) AS group_id,
*
FROM
sequenced_data
Example data:
sequence_id | time_id | t-s | group_id
-------------+---------+-----+----------
1 | 1 | 0 | 1
2 | 2 | 0 | 1
3 | 3 | 0 | 1
4 | 8 | 4 | 2
5 | 9 | 4 | 2
6 | 12 | 6 | 3
7 | 14 | 7 | 4
8 | 15 | 7 | 4
NOTE: This does assume there are not multiple records with the same time. If there are, they would need to be filtered out first. Probably just using a GROUP BY in a preceding CTE.

Related

PostgreSQL combining a count, a last 7 days count and a last 30 days count Grouped by days

I have a small table (TABLEA) which has two columns IDA & DATETIMEA.
TABLEA
IDA DATETIMEA
1 2020-03-16 13:15:00
2 2020-03-17 15:25:00
3 2020-03-18 17:10:00
5 2020-03-19 11:44:00
5 2020-03-20 12:55:00
5 2020-03-21 19:35:00
7 2020-03-22 10:13:00
8 2020-03-22 15:25:00
8 2020-03-28 12:12:00
9 2020-03-29 17:55:00
10 2020-03-30 11:54:00
12 2020-03-30 15:35:00
12 2020-03-31 13:19:00
I am trying to get the total IDA's per day, Total in Last 7 days and total in last 30 days.
Expected Output
DATE DAY L7 L30
2020-03-16 1 1 1
2020-03-17 1 2 2
2020-03-18 1 3 3
2020-03-19 1 4 4
2020-03-20 1 5 5
2020-03-21 1 6 6
2020-03-22 2 8 8
2020-03-28 1 3 9
2020-03-29 1 4 10
2020-03-30 2 6 12
2020-03-31 1 7 13
I have tried putting the date related outputs in sub queries but they return 0.
SELECT t.DATETIMEA::date date,
COUNT(t.*) "day",
(SELECT COUNT(w.*) FROM TABLEA w WHERE w.DATETIMEA::date BETWEEN w.DATETIMEA::date AND w.DATETIMEA::date - 7) week,
(SELECT COUNT(m.*) FROM TABLEA m WHERE m.DATETIMEA::date BETWEEN m.DATETIMEA::date AND m.DATETIMEA::date - 30) "month"
FROM TABLEA t
GROUP BY t.DATETIMEA::date
ORDER BY t.DATETIMEA::date
If your data may have gaps in days, then you need a range frame specification rather than a rows frame. Happily Postgres supports this specification, so you can do:
select
datetimea::date date,
count(*) "day",
sum(count(*)) over(
order by datetimea::date
range between '7 day' preceding and current row
) l7,
sum(count(*)) over(
order by datetimea::date
range between '30 day' preceding and current row
) l30
from mytable
group by datetimea::date
order by datetimea::date
Demo on DB Fiddle:
date | day | l7 | l30
:--------- | --: | -: | --:
2020-03-16 | 1 | 1 | 1
2020-03-17 | 1 | 2 | 2
2020-03-18 | 1 | 3 | 3
2020-03-19 | 1 | 4 | 4
2020-03-20 | 1 | 5 | 5
2020-03-21 | 1 | 6 | 6
2020-03-22 | 2 | 8 | 8
2020-03-28 | 1 | 4 | 9
2020-03-29 | 1 | 4 | 10
2020-03-30 | 2 | 4 | 12
2020-03-31 | 1 | 5 | 13
I think window functions do what you want:
select t.datetimea::date, count(*) as on_day,
sum(count(*)) over (order by t.datetimea::date rows between 6 preceding and current row) as sum_7,
sum(count(*)) over (order by t.datetimea::date rows between 29 preceding and current row) as sum_30
from tablea t
group by t.datetimea::date;
By "last 7 days", I assume you mean today and the preceding 6 days. If you mean 7 days before today, the window frame can easily be adjusted.

Postgres generate series of columns

I have a couple of tables
Table 1:
meter_id | date
1 2019-01-01
Table 2:
meter_id | read_date | period | read
1 2019-01-01 1 5
1 2019-01-01 2 6
1 2019-01-01 5 2
1 2019-01-01 6 1
1 2019-01-01 7 2
2 2019-01-01 1 3
2 2019-01-01 2 10
1 2019-01-02 6 7
Is it possible to generate a series of columns so I end up with something like this:
meter_id | read_date | p_1 | p_2 | p_3 | p_4 | p_5 | p_6 ...
1 2019-01-01 5 6 2 1
2 2019-01-01 3 10
1 2019-01-02 7
where there are 48 reads per day (every half hour)
Without having to do multiple select statements?
You can use conditional aggregation:
select t1.meter_id, t1.date,
max(t2.read) filter (where period = 1) as p_1,
max(t2.read) filter (where period = 2) as p_2,
max(t2.read) filter (where period = 3) as p_3,
. . .
from table1 t1 join
table2 t2
on t1.meter_id = t2.meter_id and t1.date = t2.read_date
group by t1.meter_id, t1.date;

How to get an hourly average number of unique persons using Hive?

I have this data in a table my_table:
camera_id person_id datetime
1 1 2017-03-02 18:06:20
1 1 2017-03-02 18:05:10
1 1 2017-04-01 18:04:09
2 1 2017-03-02 19:06:50
2 2 2017-03-02 19:07:22
2 2 2017-03-02 19:09:15
2 3 2017-05-03 19:07:05
2 4 2017-05-03 19:19:08
2 5 2017-05-03 19:20:18
I need to count an hourly average number of UNIQUE persons detected by each camera.
For example let's take camera 2 and a time window from 19:00 to 20:00. The camera determined 2 unique visits on 2017-03-02 and 3 unique visits on 2017-05-03. So, the answer is (2+3)/2 = 2.5
Expected result:
camera_id HOUR HOURLY_AVG_COUNT
1 18 1
2 19 2.5
select camera_id
,hour(datetime) as hour
,count(distinct person_id,date(datetime),hour(datetime)) /
count(distinct date(datetime),hour(datetime)) as hourly_avg_count
from my_table
group by camera_id
,hour(datetime)
order by camera_id
;
+-----------+------+------------------+
| camera_id | hour | hourly_avg_count |
+-----------+------+------------------+
| 1 | 18 | 1 |
| 2 | 19 | 2.5 |
+-----------+------+------------------+
P.s.
date(datetime),hour(datetime) can be also replaced by one of the following:
substr(cast(datetimeas string),1,13)
date_format(datetime,'yyyy-MM-dd HH')

Looking for duplicate transactions within a 5 minutes over a 24 hour time period

I am looking for duplicate transactions between a 5 minute window during a 24 hour period. I am trying to find users abusing other users access. Here is what I have so far, but it is only searching the past 5 minutes and not searching the 24 hour period. It is ORACLE.
SELECT p.id, Count(*) count
FROM tranledg tl,
patron p
WHERE p.id = tl.patronid
AND tl.trandate > (sysdate-5/1440)
AND tl.plandesignation in ('1')
AND p.id in (select id from tranledg tl where tl.trandate > (sysdate-1))
GROUP BY p.id
HAVING COUNT(*)> 1
Example data:
Patron
id | Name
--------------------------
1 | Joe
2 | Henry
3 | Tom
4 | Mary
5 | Sue
6 | Marie
Tranledg
tranid | trandate | location | patronid
--------------------------
1 | 2015-03-01 12:01:00 | 1500 | 1
2 | 2015-03-01 12:01:15 | 1500 | 2
3 | 2015-03-01 12:03:30 | 1500 | 1
4 | 2015-03-01 12:04:00 | 1500 | 3
5 | 2015-03-01 15:01:00 | 1500 | 4
6 | 2015-03-01 15:01:15 | 1500 | 4
7 | 2015-03-01 17:01:15 | 1500 | 2
8 | 2015-03-01 18:01:30 | 1500 | 1
9 | 2015-03-01 19:02:00 | 1500 | 3
10 | 2015-03-01 20:01:00 | 1500 | 4
11 | 2015-03-01 21:01:00 | 1500 | 5
I would expect the following data to return:
ID | COUNT
1 | 2
4 | 2
You can use an analytic clause with a range window like this:
select *
from (select tranid
, patronid
, count(*) over(partition by patronid
order by trandate
range between 0 preceding
and 5/60/24 following) count
from tranledg
where trandate >= sysdate-1)
where count > 1
It will output all transactions that are followed with more ones for the same patronid in the range of 5 minutes along with the count of the transactions in the range (you did not specify what to do if there are more than one such a range or when the ranges are overlapping).
Output on the test data (without the condition for sysdate as it already passed):
TRANID PATRONID COUNT
------ -------- -----
1 1 2
5 4 2
I did it using Postgres online, Oracle version very similar, only be carefull with date operation.
SQL DEMO
You need a self join.
SELECT T1.patronid, count(*)
FROM Tranledg T1
JOIN Tranledg T2
ON T2."trandate" BETWEEN T1."trandate" + '-2 minute' AND T1."trandate" + '2 minute'
AND T1."patronid" = T2."patronid"
AND T1."tranid" <> T2."tranid"
GROUP BY T1.patronid;
OUTPUT
You need to fix the data, so 1 has two records.

How to insert additional values in between a GROUP BY

i am currently making a monthly report using MySQL. I have a table named "monthly" that looks something like this:
id | date | amount
10 | 2009-12-01 22:10:08 | 7
9 | 2009-11-01 22:10:08 | 78
8 | 2009-10-01 23:10:08 | 5
7 | 2009-07-01 21:10:08 | 54
6 | 2009-03-01 04:10:08 | 3
5 | 2009-02-01 09:10:08 | 456
4 | 2009-02-01 14:10:08 | 4
3 | 2009-01-01 20:10:08 | 20
2 | 2009-01-01 13:10:15 | 10
1 | 2008-12-01 10:10:10 | 5
Then, when i make a monthly report (which is based by per month of per year), i get something like this.
yearmonth | total
2008-12 | 5
2009-01 | 30
2009-02 | 460
2009-03 | 3
2009-07 | 54
2009-10 | 5
2009-11 | 78
2009-12 | 7
I used this query to achieved the result:
SELECT substring( date, 1, 7 ) AS yearmonth, sum( amount ) AS total
FROM monthly
GROUP BY substring( date, 1, 7 )
But I need something like this:
yearmonth | total
2008-01 | 0
2008-02 | 0
2008-03 | 0
2008-04 | 0
2008-05 | 0
2008-06 | 0
2008-07 | 0
2008-08 | 0
2008-09 | 0
2008-10 | 0
2008-11 | 0
2008-12 | 5
2009-01 | 30
2009-02 | 460
2009-03 | 3
2009-05 | 0
2009-06 | 0
2009-07 | 54
2009-08 | 0
2009-09 | 0
2009-10 | 5
2009-11 | 78
2009-12 | 7
Something that would display the zeroes for the month that doesnt have any value. Is it even possible to do that in a MySQL query?
You should generate a dummy rowsource and LEFT JOIN with it:
SELECT *
FROM (
SELECT 1 AS month
UNION ALL
SELECT 2
…
UNION ALL
SELECT 12
) months
CROSS JOIN
(
SELECT 2008 AS year
UNION ALL
SELECT 2009 AS year
) years
LEFT JOIN
mydata m
ON m.date >= CONCAT_WS('.', year, month, 1)
AND m.date < CONCAT_WS('.', year, month, 1) + INTERVAL 1 MONTH
GROUP BY
year, month
You can create these as tables on disk rather than generate them each time.
MySQL is the only system of the major four that does have allow an easy way to generate arbitrary resultsets.
Oracle, SQL Server and PostgreSQL do have those (CONNECT BY, recursive CTE's and generate_series, respectively)
Quassnoi is right, and I'll add a comment about how to recognize when you need something like this:
You want '2008-01' in your result, yet nothing in the source table has a date in January, 2008. Result sets have to come from the tables you query, so the obvious conclusion is that you need an additional table - one that contains each month you want as part of your result.