How to get an hourly average number of unique persons using Hive? - sql

I have this data in a table my_table:
camera_id person_id datetime
1 1 2017-03-02 18:06:20
1 1 2017-03-02 18:05:10
1 1 2017-04-01 18:04:09
2 1 2017-03-02 19:06:50
2 2 2017-03-02 19:07:22
2 2 2017-03-02 19:09:15
2 3 2017-05-03 19:07:05
2 4 2017-05-03 19:19:08
2 5 2017-05-03 19:20:18
I need to count an hourly average number of UNIQUE persons detected by each camera.
For example let's take camera 2 and a time window from 19:00 to 20:00. The camera determined 2 unique visits on 2017-03-02 and 3 unique visits on 2017-05-03. So, the answer is (2+3)/2 = 2.5
Expected result:
camera_id HOUR HOURLY_AVG_COUNT
1 18 1
2 19 2.5

select camera_id
,hour(datetime) as hour
,count(distinct person_id,date(datetime),hour(datetime)) /
count(distinct date(datetime),hour(datetime)) as hourly_avg_count
from my_table
group by camera_id
,hour(datetime)
order by camera_id
;
+-----------+------+------------------+
| camera_id | hour | hourly_avg_count |
+-----------+------+------------------+
| 1 | 18 | 1 |
| 2 | 19 | 2.5 |
+-----------+------+------------------+
P.s.
date(datetime),hour(datetime) can be also replaced by one of the following:
substr(cast(datetimeas string),1,13)
date_format(datetime,'yyyy-MM-dd HH')

Related

Enumerating records by date

Say we have 5 records for items sold on particular dates like this
Date of Purchase Qty
2016-11-29 19:33:50.000 5
2017-01-03 20:09:49.000 4
2017-02-23 16:21:21.000 11
2016-11-29 14:33:51.000 2
2016-12-02 16:24:29.000 4
I´d like to enumerate each record by the date in order with an extra column like this:
Date of Purchase Qty Order
2016-11-29 19:33:50.000 5 1
2017-01-03 20:09:49.000 4 3
2017-02-23 16:21:21.000 11 4
2016-11-29 14:33:51.000 2 1
2016-12-02 16:24:29.000 4 2
Notice how both dates on 2016-11-29 have the same order number because I only want to order the records by the date and not by the datetime. How would I create this extra column in just plain SQL?
Using dense_rank() and ordering by the date of DateOfPurchase
select *
, [Order] = dense_rank() over (order by convert(date,DateOfPurchase))
from t
rextester demo: http://rextester.com/FAAQL92440
returns:
+---------------------+-----+-------+
| DateOfPurchase | Qty | Order |
+---------------------+-----+-------+
| 2016-11-29 19:33:50 | 5 | 1 |
| 2016-11-29 14:33:51 | 2 | 1 |
| 2016-12-02 16:24:29 | 4 | 2 |
| 2017-01-03 20:09:49 | 4 | 3 |
| 2017-02-23 16:21:21 | 11 | 4 |
+---------------------+-----+-------+

Looking for duplicate transactions within a 5 minutes over a 24 hour time period

I am looking for duplicate transactions between a 5 minute window during a 24 hour period. I am trying to find users abusing other users access. Here is what I have so far, but it is only searching the past 5 minutes and not searching the 24 hour period. It is ORACLE.
SELECT p.id, Count(*) count
FROM tranledg tl,
patron p
WHERE p.id = tl.patronid
AND tl.trandate > (sysdate-5/1440)
AND tl.plandesignation in ('1')
AND p.id in (select id from tranledg tl where tl.trandate > (sysdate-1))
GROUP BY p.id
HAVING COUNT(*)> 1
Example data:
Patron
id | Name
--------------------------
1 | Joe
2 | Henry
3 | Tom
4 | Mary
5 | Sue
6 | Marie
Tranledg
tranid | trandate | location | patronid
--------------------------
1 | 2015-03-01 12:01:00 | 1500 | 1
2 | 2015-03-01 12:01:15 | 1500 | 2
3 | 2015-03-01 12:03:30 | 1500 | 1
4 | 2015-03-01 12:04:00 | 1500 | 3
5 | 2015-03-01 15:01:00 | 1500 | 4
6 | 2015-03-01 15:01:15 | 1500 | 4
7 | 2015-03-01 17:01:15 | 1500 | 2
8 | 2015-03-01 18:01:30 | 1500 | 1
9 | 2015-03-01 19:02:00 | 1500 | 3
10 | 2015-03-01 20:01:00 | 1500 | 4
11 | 2015-03-01 21:01:00 | 1500 | 5
I would expect the following data to return:
ID | COUNT
1 | 2
4 | 2
You can use an analytic clause with a range window like this:
select *
from (select tranid
, patronid
, count(*) over(partition by patronid
order by trandate
range between 0 preceding
and 5/60/24 following) count
from tranledg
where trandate >= sysdate-1)
where count > 1
It will output all transactions that are followed with more ones for the same patronid in the range of 5 minutes along with the count of the transactions in the range (you did not specify what to do if there are more than one such a range or when the ranges are overlapping).
Output on the test data (without the condition for sysdate as it already passed):
TRANID PATRONID COUNT
------ -------- -----
1 1 2
5 4 2
I did it using Postgres online, Oracle version very similar, only be carefull with date operation.
SQL DEMO
You need a self join.
SELECT T1.patronid, count(*)
FROM Tranledg T1
JOIN Tranledg T2
ON T2."trandate" BETWEEN T1."trandate" + '-2 minute' AND T1."trandate" + '2 minute'
AND T1."patronid" = T2."patronid"
AND T1."tranid" <> T2."tranid"
GROUP BY T1.patronid;
OUTPUT
You need to fix the data, so 1 has two records.

Weekly Average Reports: Redshift

My Sales data for first two weeks of june, Monday Date i.e 1st Jun , 8th Jun are below
date | count
2015-06-01 03:25:53 | 1
2015-06-01 03:28:51 | 1
2015-06-01 03:49:16 | 1
2015-06-01 04:54:14 | 1
2015-06-01 08:46:15 | 1
2015-06-01 13:14:09 | 1
2015-06-01 16:20:13 | 5
2015-06-01 16:22:13 | 1
2015-06-01 16:27:07 | 1
2015-06-01 16:29:57 | 1
2015-06-01 19:16:45 | 1
2015-06-08 10:54:46 | 1
2015-06-08 15:12:10 | 1
2015-06-08 20:35:40 | 1
I need a find weekly avg of sales happened in a given range .
Complex Query:
(some_manipulation_part), ifact as
( select date, sales_count from final_result_set
) select date_part('h',date )) as h ,
date_part('dow',date )) as day_of_week ,
count(sales_count)
from final_result_set
group by h, dow.
Output :
h | day_of_week | count
3 | 1 | 3
4 | 1 | 1
8 | 1 | 1
10 | 1 | 1
13 | 1 | 1
15 | 1 | 1
16 | 1 | 8
19 | 1 | 1
20 | 1 | 1
If I try to apply avg on the above final result, It is not actually fetching correct answer!
(some_manipulation_part), ifact as
( select date, sales_count from final_result_set
) select date_part('h',date )) as h ,
date_part('dow',date )) as day_of_week ,
avg(sales_count)
from final_result_set
group by h, dow.
h | day_of_week | count
3 | 1 | 1
4 | 1 | 1
8 | 1 | 1
10 | 1 | 1
13 | 1 | 1
15 | 1 | 1
16 | 1 | 1
19 | 1 | 1
20 | 1 | 1
So I 've two mondays in the given range, it is not actually dividing by it. I am not even sure what is happening inside redshift.
To get "weekly averages" use date_trunc():
SELECT date_trunc('week', my_date_column) as week
, avg(sales_count) AS avg_sales
FROM final_result_set
GROUP BY 1;
I hope you are not actually using date as name for your date column. It's a reserved word in SQL and a basic type name, don't use it as identifier.
If you group by the day of week (DOW) you get averages per weekday. and sunday is 0. (Use ISODOW to get 7 for Sunday.)

Oracle GROUP BY similar timestamps?

I have an activity table with a structure like this:
id prd_id act_dt grp
------------------------------------
1 1 2000-01-01 00:00:00
2 1 2000-01-01 00:00:01
3 1 2000-01-01 00:00:02
4 2 2000-01-01 00:00:00
5 2 2000-01-01 00:00:01
6 2 2000-01-01 01:00:00
7 2 2000-01-01 01:00:01
8 3 2000-01-01 00:00:00
9 3 2000-01-01 00:00:01
10 3 2000-01-01 02:00:00
I want to split the data within this activity table by product (prd_id) and activity date (act_dt), and update the the group (grp) column with a value from a sequence for each of these groups.
The kicker is, I need to group by similar timestamps, where similar means "all records have a difference of exactly 1 second." In other words, within a group, the difference between any 2 records when sorted by date will be exactly 1 second, and the difference between the first and last records can be any amount of time, so long as all the intermediary records are 1 second apart.
For the example data, the groups would be:
id prd_id act_dt grp
------------------------------------
1 1 2000-01-01 00:00:00 1
2 1 2000-01-01 00:00:01 1
3 1 2000-01-01 00:00:02 1
4 2 2000-01-01 00:00:00 2
5 2 2000-01-01 00:00:01 2
6 2 2000-01-01 01:00:00 3
7 2 2000-01-01 01:00:01 3
8 3 2000-01-01 00:00:00 4
9 3 2000-01-01 00:00:01 4
10 3 2000-01-01 02:00:00 5
What method would I use to accomplish this?
The size of the table is ~20 million rows, if that affects the method used to solve the problem.
I'm not an Oracle wiz, so I'm guessing at the best option for one line:
(CAST('2010-01-01' AS DATETIME) - act_dt) * 24 * 60 * 60 AS time_id,
This just needs to be "the number of seconds from [aDateConstant] to act_dt". The result can be negative. It just needs to be a the number of seconds, to turn your act_dt into an INT. The rest should work fine.
WITH
sequenced_data
AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY prd_id ORDER BY act_dt) AS sequence_id,
(CAST('2010-01-01' AS DATETIME) - act_dt) * 24 * 60 * 60 AS time_id,
*
FROM
yourTable
)
SELECT
DENSE_RANK() OVER (PARTITION BY prd_id ORDER BY time_id - sequence_id) AS group_id,
*
FROM
sequenced_data
Example data:
sequence_id | time_id | t-s | group_id
-------------+---------+-----+----------
1 | 1 | 0 | 1
2 | 2 | 0 | 1
3 | 3 | 0 | 1
4 | 8 | 4 | 2
5 | 9 | 4 | 2
6 | 12 | 6 | 3
7 | 14 | 7 | 4
8 | 15 | 7 | 4
NOTE: This does assume there are not multiple records with the same time. If there are, they would need to be filtered out first. Probably just using a GROUP BY in a preceding CTE.

How to insert additional values in between a GROUP BY

i am currently making a monthly report using MySQL. I have a table named "monthly" that looks something like this:
id | date | amount
10 | 2009-12-01 22:10:08 | 7
9 | 2009-11-01 22:10:08 | 78
8 | 2009-10-01 23:10:08 | 5
7 | 2009-07-01 21:10:08 | 54
6 | 2009-03-01 04:10:08 | 3
5 | 2009-02-01 09:10:08 | 456
4 | 2009-02-01 14:10:08 | 4
3 | 2009-01-01 20:10:08 | 20
2 | 2009-01-01 13:10:15 | 10
1 | 2008-12-01 10:10:10 | 5
Then, when i make a monthly report (which is based by per month of per year), i get something like this.
yearmonth | total
2008-12 | 5
2009-01 | 30
2009-02 | 460
2009-03 | 3
2009-07 | 54
2009-10 | 5
2009-11 | 78
2009-12 | 7
I used this query to achieved the result:
SELECT substring( date, 1, 7 ) AS yearmonth, sum( amount ) AS total
FROM monthly
GROUP BY substring( date, 1, 7 )
But I need something like this:
yearmonth | total
2008-01 | 0
2008-02 | 0
2008-03 | 0
2008-04 | 0
2008-05 | 0
2008-06 | 0
2008-07 | 0
2008-08 | 0
2008-09 | 0
2008-10 | 0
2008-11 | 0
2008-12 | 5
2009-01 | 30
2009-02 | 460
2009-03 | 3
2009-05 | 0
2009-06 | 0
2009-07 | 54
2009-08 | 0
2009-09 | 0
2009-10 | 5
2009-11 | 78
2009-12 | 7
Something that would display the zeroes for the month that doesnt have any value. Is it even possible to do that in a MySQL query?
You should generate a dummy rowsource and LEFT JOIN with it:
SELECT *
FROM (
SELECT 1 AS month
UNION ALL
SELECT 2
…
UNION ALL
SELECT 12
) months
CROSS JOIN
(
SELECT 2008 AS year
UNION ALL
SELECT 2009 AS year
) years
LEFT JOIN
mydata m
ON m.date >= CONCAT_WS('.', year, month, 1)
AND m.date < CONCAT_WS('.', year, month, 1) + INTERVAL 1 MONTH
GROUP BY
year, month
You can create these as tables on disk rather than generate them each time.
MySQL is the only system of the major four that does have allow an easy way to generate arbitrary resultsets.
Oracle, SQL Server and PostgreSQL do have those (CONNECT BY, recursive CTE's and generate_series, respectively)
Quassnoi is right, and I'll add a comment about how to recognize when you need something like this:
You want '2008-01' in your result, yet nothing in the source table has a date in January, 2008. Result sets have to come from the tables you query, so the obvious conclusion is that you need an additional table - one that contains each month you want as part of your result.