Aggregrate the variable from timestamp on bigQuery - sql

I am planning to calculate the most frequency part_of_day for each of the user. In this case, firstly, I encoded timestamp with part_of_day, then aggregrate with the most frequency part_of_day. I use the ARRAY_AGG to calculate the mode (). However, I’m not sure how to deal with timestamp with the ARRAY_AGG, because there is error, so my code structure might be wrong
SELECT User_ID, time,
ARRAY_AGG(Time ORDER BY cnt DESC LIMIT 1)[OFFSET(0)] part_of_day,
case
when time BETWEEN '04:00:00' AND '12:00:00'
then "morning"
when time < '04:00:00' OR time > '20:00:00'
then "night"
end AS part_of_day
FROM (
SELECT User_ID,
TIME_TRUNC(TIME(Request_Timestamp), SECOND) AS Time
COUNT(*) AS cnt
Error received:
Syntax error: Expected ")" but got identifier "COUNT" at [19:9]

Even though you did not share any sample data, I was able to identify some issues within your code.
I have used some sample data I created based in the formats and functions you used in your code to keep consistency. Below is the code, without any errors:
WITH data AS (
SELECT 98 as User_ID,DATETIME "2008-12-25 05:30:00.000000" AS Request_Timestamp, "something!" AS channel UNION ALL
SELECT 99 as User_ID,DATETIME "2008-12-25 22:30:00.000000" AS Request_Timestamp, "something!" AS channel
)
SELECT User_ID, time,
ARRAY_AGG(Time ORDER BY cnt DESC LIMIT 1)[OFFSET(0)] part_of_day1,
case
when time BETWEEN '04:00:00' AND '12:00:00'
then "morning"
when time < '04:00:00' OR time > '20:00:00'
then "night"
end AS part_of_day
FROM (
SELECT User_ID,
TIME_TRUNC(TIME(Request_Timestamp), SECOND) AS time,
COUNT(*) AS cnt
FROM data
GROUP BY User_ID, Channel, Request_Timestamp
#order by Request_Timestamp
)
GROUP BY User_ID, Time;
First, notice that I have changed the column's name in your ARRAY_AGG() method, it had to be done because it would cause the error "Duplicate column name". Second, after your TIME_TRUNC() function, it was missing a comma so you could select COUNT(*). Then, within your GROUP BY, you needed to group Request_Timestamp as well because it wasn't aggregated nor grouped. Lastly, in your last GROUP BY, you needed to aggregate or group time. Thus, after theses corrections, your code will execute without any errors.
Note: the Syntax error: Expected ")" but got identifier "COUNT" at [19:9] error you experienced is due to the missing comma. The others would be shown after correcting this one.

If you want the most frequent part of each day, you need to use the day part in the aggregation:
SELECT User_ID,
ARRAY_AGG(part_of_day ORDER BY cnt DESC LIMIT 1)[OFFSET(0)] part_of_day
FROM (SELECT User_ID,
(case when time BETWEEN '04:00:00' AND '12:00:00' then 'morning'
when time < '04:00:00' OR time > '20:00:00' then 'night'
end) AS part_of_day
COUNT(*) AS cnt
FROM cognitivebot2.chitchaxETL.conversations
GROUP BY User_ID, part_of_day
) u
GROUP BY User_ID;
Obviously, if you want the channel as well, then you need to include that in the queries.

Related

select rows with condition of date presto

I try to select by hour the number of impression for a particular day :
I try with this code :
SELECT
date_trunc('hour', CAST(date_time AS timestamp)) date_time,
COUNT(impression_id) AS count_impression_id
FROM
parquet_db.imp_pixel
WHERE
date_time = '2022-07-27'
LIMIT 100
GROUP BY 1
But I got this error when I add the "where" clause :
line 5:1: mismatched input 'group'. Expecting:
Can you help me to fix it? thanks
LIMIT usually comes last in a SQL query. Also, you should not be using LIMIT without ORDER BY. Use this version:
SELECT DATE_TRUNC('hour', CAST(date_time AS timestamp)) date_time,
COUNT(impression_id) AS count_impression_id
FROM parquet_db.imp_pixel
WHERE CAST(date_time AS date) = '2022-07-27'
GROUP BY 1
ORDER BY <something>
LIMIT 100;
Note that the ORDER BY clause determines which 100 records you get in the result set. Your current (intended) query lets Presto decide on its own which 100 records get returned.

SQL list number of seen occurrences per month higher than 15

I'm not an SQL expert, so I'm requesting your help to list the MACs that apear more than 15 days in a month.
I made the following query, but is very complex and most probably not efficient. Any suggestions on how to make it simpler and efficient?
I'm using Google BigQuery, if that helps.
SELECT
macDays.macAddress AS macAddress,
macDays.days AS days
FROM (
SELECT
list_mac.macAddress AS macAddress,
COUNT( list_mac.macAddress) AS days
FROM (
SELECT
macAddress,
TIMESTAMP_TRUNC(time, DAY) date,
FROM
`my_table`
WHERE
time BETWEEN '2021-06-01 00:00:00'
AND '2021-06-30 23:59:00.000059'
GROUP BY
macAddress,
date
ORDER BY
macAddress) AS list_mac
GROUP BY
macAddress ) AS macDays
WHERE
macDays.days > 15
GROUP BY
macAddress,
days
The problem is that you are stripping off the time component from the date in your SELECT but grouping with the time portion left in, so you will get one row for every appearance rather than one for every day.
You can probably get rid of the inner subquery by using COUNT(DISTINCT field).
Try something like:
SELECT
macAddress AS macAddress,
COUNT(DISTINCT TIMESTAMP_TRUNC(time, DAY)) AS days
FROM
`my_table`
WHERE
time BETWEEN '2021-06-01 00:00:00'
AND '2021-06-30 23:59:00.000059'
GROUP BY
macAddress
HAVING
COUNT(DISTINCT TIMESTAMP_TRUNC(time, DAY)) > 15
ORDER BY
macAddress
You can do this by using a subquery. It will calculate how many times a MAC exist in a day. Then it will pick only those appeared more than 15times in a month.
I have not used any filter so you can add filter as and when needed. If you need how many times MACs appear in database in a single day, you can use dt as group by. And if you want how many total MACs exists in whole month, just remove distinct.
SELECT COUNT(*) cnt,
MAC,
mnth
FROM
(SELECT DISTINCT -- This will select only unique MACs on a day
macAddress,
TIMESTAMP_TRUNC(TIME, DAY) dt,
TIMESTAMP_TRUNC(TIME, MONTH) mnth,
FROM `my_table`) q
GROUP BY Mnth
HAVING COUNT(*)>15
would go with HAVING, which performs filtering to columns aggregated via group by:
select substr(time, 0, 8) yrmonth, macAddress, count(*) macDays
from my_table
where yrmonth = '2021-06'
group by macAddress, substr(time, 0, 8)
having count(*) >= 15
order by yrmonth desc
have not tried for GoogleBigQuery, here is the example on SQLite: SQL Fiddler

Grouping Consecutive Timestamps (Redshift)

Got something that I cant get my head around
raw data shows every 15 min intervals and I would like to group them based on if they are consecutive 15 min intervals (see screenshot below) I will like to do this multiple times for each user and for alot of users... Any ideas on how to do this using sql only that can scale to 1000's users?
Any help would be appreicated
Thanks
This is a type of gaps-and-islands problem. Use lag() to get the difference, then a cumulative sum to identify the group:
select user_id, min(start_time), max(end_time)
from (select t.*,
sum( case when prev_end_time <> start_time then 0 else 1 end) over (partition by user_id order by start_time) as grp
from (select t.*,
lag(end_time) over (partition by user_id order by start_time) as prev_end_time
from t
) t
) t
group by user_id, grp;

Postgres sql query by time window

I have a table "meterreading" that has columns: "timestamp", "value", "meterId". I would like to get sums of the "value" for each hour starting a specific time... So far I have come up with this query, but it is erroring saying I need to group by timestamp. Timestamps are just integers representing unix epoch timestamps.
select date_trunc('hour', to_timestamp(timestamp)) as hours, sum(value)
from meterreading
WHERE timestamp >= 1377993600 AND timestamp < 1409595081
group by date_trunc('hours', to_timestamp(timestamp))
order by date_trunc('hours', to_timestamp(timestamp)) asc
select date_trunc('hour', to_timestamp(timestamp)) as hours, sum(value)
from meterreading
WHERE timestamp >= 1377993600 AND timestamp < 1409595081
group by 1
order by 1
or use the exact same expression used in the select list
group by date_trunc('hour', to_timestamp(timestamp));
Notice 'hour' in instead of 'hours'. Hence the convenience of the number reference syntax in the group by. It is clearer and less prone to errors.

Is it possible to convert this query to use a join instead of a subquery?

SELECT
number, count(id)
FROM
tracking
WHERE
id IN (SELECT max(id) FROM tracking WHERE splitnr = 'a11' AND number >0 AND timestamp >= '2009-04-08 00:00:00' AND timestamp <= '2009-04-08 12:55:57' GROUP BY ident)
GROUP BY
number
How about this:
SELECT number, count(id)
FROM tracking
INNER JOIN (SELECT max(id) ID FROM tracking
WHERE splitnr = 'a11' AND
number >0 AND timestamp >= '2009-04-08 00:00:00' AND
timestamp <= '2009-04-08 12:55:57'
GROUP BY ident
) MID ON (MID.ID=tracking.id)
WHERE
GROUP BY number
Could you not do something like:
SELECT
number,
count(id)
FROM
tracking
WHERE
splitnr = 'a11' AND number > 0 AND timestamp >= '2009-04-08 00:00:00' AND timestamp <= '2009-04-08 12:55:57'
GROUP BY
number
ORDER BY
number DESC
LIMIT 0,1
(I don't really know MySQL by the way)
I'm assuming this would give you back the same resultset, you order it by the number desc because you want the maximum one, right? Then you can put the WHERE clause in and limit it by one to give you the first one which is essentially the same as MAX (I think) Thus removing the JOIN altogether.
EDIT: I didn't think you'd need the GROUP BY identd either
Slightly hard to make sure that I've got it entirely right without seeing the data and knowing exactly what you're trying to achieve but personally I'd turn the sub-query into a view and then join on that, so:
create view vMaximumIDbyIdent
as
SELECT ident, max(id) maxid
FROM tracking
WHERE splitnr = 'a11' AND number >0
AND timestamp >= '2009-04-08 00:00:00'
AND timestamp <= '2009-04-08 12:55:57'
GROUP BY ident
then:
SELECT
number, count(id)
FROM
tracking,
vMaximumIDbyIdent
WHERE
tracking.id = vMaximumIDbyIdent.maxid
GROUP BY
number
More readable and maintainable.