I'm trying to GROUP BY to count events over weeks in Hive. What I'd like to get out is the date for each Saturday of the year (the output only needs to return results for weeks where we have data) and the number of events that occurred over the entire preceding week (ie, the num_events column should be the total number of events from Sunday through Saturday).
Example Desired Output:
+------------+------------+
| ymd_date | num_events |
+------------+------------+
| 2016-01-09 | 42 |
| 2016-01-16 | 500 |
| 2016-01-23 | 1090 |
| . | . |
| . | . |
| . | . |
| 2016-12-31 | 23125 |
+------------+------------+
But I'm not sure how to convert from WEEKOFYEAR to get the date for each Saturday.
What I Have So Far:
SELECT
concat_ws('-', cast(YEAR(FROM_UNIXTIME(time))as string),
lpad(cast(MONTH(FROM_UNIXTIME(time))as string), 2, '0'),
cast(WEEKOFYEAR(FROM_UNIXTIME(time))as string)) as ymd_date,
COUNT(*) as num_events
FROM
mytable
GROUP BY
concat_ws('-', cast(YEAR(FROM_UNIXTIME(time))as string),
lpad(cast(MONTH(FROM_UNIXTIME(time))as string), 2, '0'),
cast(WEEKOFYEAR(FROM_UNIXTIME(time))as string))
ORDER BY
ymd_date
Example Current Output:
+------------+------------+
| ymd_date | num_events |
+------------+------------+
| 2016-01-1 | 42 |
| 2016-01-2 | 500 |
| 2016-01-3 | 1090 |
| . | . |
| . | . |
| . | . |
| 2016-12-52 | 23125 |
+------------+------------+
I think what I have so far is just about there, but the date (the ymd_date column) shows the year-month-weekofyear instead of year-month-day.
Any ideas on how to produce the yyyy-mm-dd for each week?
date_sub(next_day(from_unixtime (time),'SAT'),7)
Hive Operators and User-Defined Functions (UDFs)
select date_sub(next_day(from_unixtime(time),'SAT'),7) as ymd_date
,count(*) as num_events
from mytable
group by date_sub(next_day(from_unixtime(time),'SAT'),7)
order by ymd_date
hive> select date_sub(next_day(from_unixtime(unix_timestamp()),'SAT'),7);
OK
2016-12-17
Related
So after looking at what seems to be a common question being asked and not being able to get any solution to work for me, I decided I should ask for myself.
I have a data set with two columns: session_start_time, uid
I am trying to generate a rolling 30 day tally of unique sessions
It is simple enough to query for the number of unique uids per day:
SELECT
COUNT(DISTINCT(uid))
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '30 days'
it is also relatively simple to calculate the daily unique uids over a date range.
SELECT
DATE_TRUNC('day',session_start_time) AS "date"
,COUNT(DISTINCT uid) AS "count"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY date(session_start_time)
I then I tried several ways to do a rolling 30 day unique count over a time interval
SELECT
DATE(session_start_time) AS "running30day"
,COUNT(distinct(
case when date(session_start_time) >= running30day - interval '30 days'
AND date(session_start_time) <= running30day
then uid
end)
) AS "unique_30day"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '3 months'
GROUP BY date(session_start_time)
Order BY running30day desc
I really thought this would work but when looking into the results, it appears I'm getting the same results as I was when doing the daily unique rather than the unique over 30days.
I am writing this query from Metabase using the SQL query editor. the underlying tables are in redshift.
If you read this far, thank you, your time has value and I appreciate the fact that you have spent some of it to read my question.
EDIT:
As rightfully requested, I added an example of the data set I'm working with and the desired outcome.
+-----+-------------------------------+
| UID | SESSION_START_TIME |
+-----+-------------------------------+
| | |
| 10 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 5 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 3 | 2020-01-18T02:49:23.000-05:00 |
| | |
| 9 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 2 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 8 | 2020-03-31T23:13:33.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 2 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 9 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 8 | 2020-09-15T16:40:29.000-04:00 |
| | |
| 3 | 2020-09-21T20:49:09.000-04:00 |
| | |
| 1 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 6 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 5 | 2020-12-12T04:42:00.000-05:00 |
+-----+-------------------------------+
bellow is what the result I would like looks like:
+------------+---------------------+
| DATE | UNIQUE 30 DAY COUNT |
+------------+---------------------+
| | |
| 2020-01-13 | 3 |
| | |
| 2020-01-18 | 1 |
| | |
| 2020-03-06 | 3 |
| | |
| 2020-03-31 | 1 |
| | |
| 2020-08-28 | 4 |
| | |
| 2020-09-15 | 2 |
| | |
| 2020-09-21 | 1 |
| | |
| 2020-11-05 | 2 |
| | |
| 2020-12-12 | 2 |
+------------+---------------------+
Thank you
You can approach this by keeping a counter of when users are counted and then uncounted -- 30 (or perhaps 31) days later. Then, determine the "islands" of being counted, and aggregate. This involves:
Unpivoting the data to have an "enters count" and "leaves" count for each session.
Accumulate the count so on each day for each user you know whether they are counted or not.
This defines "islands" of counting. Determine where the islands start and stop -- getting rid of all the detritus in-between.
Now you can simply do a cumulative sum on each date to determine the 30 day session.
In SQL, this looks like:
with t as (
select uid, date_trunc('day', session_start_time) as s_day, 1 as inc
from users_sessions
union all
select uid, date_trunc('day', session_start_time) + interval '31 day' as s_day, -1
from users_sessions
),
tt as ( -- increment the ins and outs to determine whether a uid is in or out on a given day
select uid, s_day, sum(inc) as day_inc,
sum(sum(inc)) over (partition by uid order by s_day rows between unbounded preceding and current row) as running_inc
from t
group by uid, s_day
),
ttt as ( -- find the beginning and end of the islands
select tt.uid, tt.s_day,
(case when running_inc > 0 then 1 else -1 end) as in_island
from (select tt.*,
lag(running_inc) over (partition by uid order by s_day) as prev_running_inc,
lead(running_inc) over (partition by uid order by s_day) as next_running_inc
from tt
) tt
where running_inc > 0 and (prev_running_inc = 0 or prev_running_inc is null) or
running_inc = 0 and (next_running_inc > 0 or next_running_inc is null)
)
select s_day,
sum(sum(in_island)) over (order by s_day rows between unbounded preceding and current row) as active_30
from ttt
group by s_day;
Here is a db<>fiddle.
I'm pretty sure the easier way to do this is to use a join. This creates a list of all the distinct users who had a session on each day and a list of all distinct dates in the data. Then it one-to-many joins the user list to the date list and counts the distinct users, the key here is the expanded join criteria that matches a range of dates to a single date via a system of inequalities.
with users as
(select
distinct uid,
date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01'),
dates as
(select
distinct date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01')
select
count(distinct uid),
dates.dt
from users
join
dates
on users.dt >= dates.dt - 29
and users.dt <= dates.dt
group by dates.dt
order by dt desc
;
I'm working on the following presto/sql query using inline filter to get side by side comparison of current date range vs weeks ago data.
In my case query current date range is 2017-09-13 to 2017-09-14.
So far I'm able to get the following results, but unfortunately this is not what I want.
Any kind of help would be greatly appreciated.
SELECT
DATE_TRUNC('day',DATE_PARSE(CAST(sample.datep AS VARCHAR),'%Y%m%d')) AS date,
CAST(SUM(sample.page_views) FILTER (WHERE sample.datep BETWEEN 20170913 AND 20170914) AS DOUBLE) AS page_views,
CAST(SUM(sample.page_views) FILTER (WHERE sample.datep BETWEEN 20170906 AND 20170907) AS DOUBLE) AS page_views_weeks_ago
FROM
sample
WHERE
(
datep BETWEEN 20170906 AND 20170914
)
GROUP BY
1
ORDER BY
1 ASC
LIMIT 50
Actual result:
+------------+------------+----------------------+
| date | page_views | page_views_weeks_ago |
+------------+------------+----------------------+
| 2017-09-06 | 0 | 990,929 |
| 2017-09-07 | 0 | 913,802 |
| 2017-09-08 | 0 | 0 |
| 2017-09-09 | 0 | 0 |
| 2017-09-10 | 0 | 0 |
| 2017-09-11 | 0 | 0 |
| 2017-09-12 | 0 | 0 |
| 2017-09-13 | 1,507,715 | 0 |
| 2017-09-14 | 48,625 | 0 |
+------------+------------+----------------------+
Expected result:
+------------+------------+----------------------+
| date | page_views | page_views_weeks_ago |
+------------+------------+----------------------+
| 2017-09-13 | 1,507,715 | 990,929 |
| 2017-09-14 | 48,625 | 913,802 |
+------------+------------+----------------------+
You can achieve with joining a table with itself as a previous day. For brevity, I assume that we have a date field so that date substructions can be done easily.
SELECT date,
SUM(curr.page_views) AS page_views,
SUM(prev.page_views) AS page_views_weeks_ago
FROM sample curr
JOIN sample prev ON curr.date - 7 = prev.date
GROUP BY 1
ORDER BY 1 ASC
I have a table like the next one:
+------------+---------+---------+---------+
| date | value 1 | value 2 | value 3 |
+------------+---------+---------+---------+
| 01/01/2017 | 263 | 7 | 222 |
| 02/01/2017 | 275 | -9 | 209 |
| 03/01/2017 | 331 | -9 | 243 |
| . | . | . | . |
| . | . | . | . |
| . | . | . | . |
+------------+---------+---------+---------+
I want to create this other one in postgres:
+---------+---------------+------------+------------+
| | 01/01/2017 | 02/01/2017 | 03/01/2017 |
+---------+---------------+------------+------------+
| value 1 | 263 | 275 | 331 |
| value 2 | 7 | -9 | -9 |
| value 3 | 222 | 209 | 243 |
+---------+---------------+------------+------------+
But my problem is that I dont know how many dates I will have, so I have to use something like this:
SELECT * FROM crosstab(
$$ SELECT value1, date FROM myTable ORDER BY 1 $$,
$$ SELECT m FROM generate_series((select min(date) from myTable) ,(select max(date) from myTable), '1 month'::interval) m $$
) AS (
".." date, ".." date, ".." date, ".." date
);
Does someone can help me? Thanks.
Your basic issue is that PostgreSQL needs to know what the columns look like in order to plan the query. Consequently you need to return some sort of fixed-column structure. There are a number of ways you can do this:
Query dates first or allow them to be input, and then generate your query in the db client.
Wrap this in a stored procedure which returns a refcursor
Wrap in a stored procedure which returns a list of JSON representations of rows.
But either way you cannot do it in a query without dynamically generating the query somewhere.
I'm attempting to combine multiple rows using a UNION but I need to pull in additional data as well. My thought was to use a UNION in the outer query but I can't seem to make it work. Or am I going about this all wrong?
The data I have is like this:
+------+------+-------+---------+---------+
| ID | Time | Total | Weekday | Weekend |
+------+------+-------+---------+---------+
| 1001 | AM | 5 | 5 | 0 |
| 1001 | AM | 2 | 0 | 2 |
| 1001 | AM | 4 | 1 | 3 |
| 1001 | AM | 5 | 3 | 2 |
| 1001 | PM | 5 | 3 | 2 |
| 1001 | PM | 5 | 5 | 0 |
| 1002 | PM | 4 | 2 | 2 |
| 1002 | PM | 3 | 3 | 0 |
| 1002 | PM | 1 | 0 | 1 |
+------+------+-------+---------+---------+
What I want to see is like this:
+------+---------+------+-------+
| ID | DayType | Time | Tasks |
+------+---------+------+-------+
| 1001 | Weekday | AM | 9 |
| 1001 | Weekend | AM | 7 |
| 1001 | Weekday | PM | 8 |
| 1001 | Weekend | PM | 2 |
| 1002 | Weekday | PM | 5 |
| 1002 | Weekend | PM | 3 |
+------+---------+------+-------+
The closest I've come so far is using UNION statement like the following:
SELECT * FROM
(
SELECT Weekday, 'Weekday' as 'DayType' FROM t1
UNION
SELECT Weekend, 'Weekend' as 'DayType' FROM t1
) AS X
Which results in something like the following:
+---------+---------+
| Weekday | DayType |
+---------+---------+
| 2 | Weekend |
| 0 | Weekday |
| 2 | Weekday |
| 0 | Weekend |
| 10 | Weekday |
+---------+---------+
I don't see any rhyme or reason as to what the numbers are under the 'Weekday' column, I suspect they're being grouped somehow. And of course there are several other columns missing, but since I can't put a large scope in the outer query with this as inner one, I can't figure out how to pull those in. Help is greatly appreciated.
It looks like you want to union all a pair of aggregation queries that use sum() and group by id, time, one for Weekday and one for Weekend:
select Id, DayType = 'Weekend', [time], Tasks=sum(Weekend)
from t
group by id, [time]
union all
select Id, DayType = 'Weekday', [time], Tasks=sum(Weekday)
from t
group by id, [time]
Try with this
select ID, 'Weekday' as DayType, Time, sum(Weekday)
from t1
group by ID, Time
union all
select ID, 'Weekend', Time, sum(Weekend)
from t1
group by ID, Time
order by order by 1, 3, 2
Not tested, but it should do the trick. It may require 2 proc sql steps for the calculation, one for summing and one for the case when statements. If you have extra lines, just use a max statement and group by ID, Time, type_day.
Proc sql; create table want as select ID, Time,
sum(weekday) as weekdayTask,
sum(weekend) as weekendTask,
case when calculated weekdaytask>0 then weekdaytask
when calculated weekendtask>0 then weekendtask else .
end as Task,
case when calculated weekdaytask>0 then "Weekday"
when calculated weekendtask>0 then "Weekend"
end as Day_Type
from have
group by ID, Time
;quit;
Proc sql; create table want2 as select ID, Time, Day_Type, Task
from want
;quit;
I have a table of events called event. For the purpose of this question it only has one field called date.
The following query returns me a number of events that are happening on each date for the next 14 days:
SELECT
DATE_FORMAT( ev.date, '%Y-%m-%d' ) as short_date,
count(*) as date_count
FROM event ev
WHERE ev.date >= NOW()
GROUP BY short_date
ORDER BY ev.start_date ASC
LIMIT 14
The result could be as follows:
+------------+------------+
| short_date | date_count |
+------------+------------+
| 2010-03-14 | 1 |
| 2010-03-15 | 2 |
| 2010-03-16 | 9 |
| 2010-03-17 | 8 |
| 2010-03-18 | 11 |
| 2010-03-19 | 14 |
| 2010-03-20 | 13 |
| 2010-03-21 | 7 |
| 2010-03-22 | 2 |
| 2010-03-23 | 3 |
| 2010-03-24 | 3 |
| 2010-03-25 | 6 |
| 2010-03-26 | 23 |
| 2010-03-27 | 14 |
+------------+------------+
14 rows in set (0.06 sec)
Let's say I want to dislay these events by date. At the same time I only want to display a maximum of 10 at a time. How would I do this?
Somehow I need to limit this result by the SUM of the date_count field but I do not know how.
Anybody run into this problem before?
Any help would be appreciated. Thanks
Edited:
The extra requirement (crucial one, oops) which I forgot in my original post, is that I only want whole days.
ie. Given the limit is 10, it would only return the following rows:
+------------+------------+
| short_date | date_count |
+------------+------------+
| 2010-03-14 | 1 |
| 2010-03-15 | 2 |
| 2010-03-16 | 9 |
+------------+------------+
use a date function to limit the 14 day range of date
use limit to display the first 10
SELECT
DATE_FORMAT( ev.date, '%Y-%m-%d' ) as short_date,
count(*) as date_count
FROM event ev
WHERE ev.date between NOW() and date_add(now(), interval 14 day)
GROUP BY date(short_date)
ORDER BY ev.start_date ASC
LIMIT 0,10
I think that using LIMIT 0, 10 will work for you.