group by more than one column - sql

I want to select three columns from a table and group by the results according to two of the. i.e, if i have columns x, y, z, I want results like (x1, y1): [z1, z9, z11 ....], (x2, y2): [z3, z12, z33 ...], ...
I tried the next (ATHENA) query:
SELECT region, family, tenancy, platform, size_factor, duration
FROM default.sell_durations
WHERE CAST(creation_date as timestamp) BETWEEN CAST(? as timestamp) AND CAST(? as timestamp)
group by region, family, tenancy, platform, size_factor
and i got the next error:
Failed to Execute Athena query, status: FAILED, reason: SYNTAX_ERROR:
line 1:56: 'duration' must be an aggregate expression or appear in
GROUP BY clause

Use one of the aggregation functions. In this case based on description array_agg seems to be an appropriate one:
SELECT region, family, tenancy, platform, size_factor, array_agg(duration)
FROM default.sell_durations
WHERE CAST(creation_date as timestamp) BETWEEN CAST(? as timestamp) AND CAST(? as timestamp)
group by region, family, tenancy, platform, size_factor

Related

Oracle SQL group by to_char - not a group by expression

I want to group by dd-mm-yyyy format to show working_hours per employee (person) per day, but I get error message ORA-00979: not a GROUP BY expression, when I remove TO_CHAR from GROUP BY it works fine, but that's not I want as I want to group by days regardless hours, what am I doing wrong here?
SELECT papf.person_number emp_id,
to_char(sh21.start_time,'dd/mm/yyyy') start_time,
to_char(sh21.stop_time,'dd/mm/yyyy') stop_time,
SUM(sh21.measure) working_hours
FROM per_all_people_f papf,
hwm_tm_rec sh21
WHERE ...
GROUP BY
papf.person_number,
to_char(sh21.start_time,'dd/mm/yyyy'),
to_char(sh21.stop_time,'dd/mm/yyyy')
ORDER BY sh21.start_time
ORDER BY sh21.start_time
needs to either be just the column alias defined in the SELECT clause:
ORDER BY start_time
or use the expression in the GROUP BY clause:
ORDER BY to_char(sh21.start_time,'dd/mm/yyyy')
If you use sh21.start_time then the table_alias.column_name syntax refers to the underlying column from the table and you are not selecting/grouping by that.

Snowflake SQL error - Invalid argument types for function '-': (TIMESTAMP_NTZ(9), TIMESTAMP_NTZ(9))

I get this error when I try to subtract the timestamps and do a window function (lead, lag and partition by):
Invalid argument types for function '-': (TIMESTAMP_NTZ(9), TIMESTAMP_NTZ(9))
Tried date_diff, but that doesn't work along with window function
SELECT
user_id,
event,
received_at,
received_at - LAG( received_at,1) OVER (PARTITION BY user_id ORDER BY received_at) AS last_event
FROM
segment_javascript.help_center_opened
You can't do it the "Oracle way" by just subtracting two dates to get a number, you must use a diff function with a unit/scale of measure, eg:
SELECT
ts,
TIMESTAMPDIFF(MILLISECONDS, LAG(ts, 1) OVER (ORDER BY ts), ts) tsd
FROM
(VALUES (CURRENT_TIMESTAMP), (DATEADD(DAY, 1, CURRENT_TIMESTAMP))) v(ts);
I suppose that you need to calcuate the time difference for the same user_id with last event received time.
If so, I think this would work:
SELECT
user_id,
event,
received_at,
DATEDIFF(
MINUTE, -- or any other supported date/time part
received_at, -- start time
LAG( received_at,1) OVER (PARTITION BY user_id ORDER BY received_at) -- end time
) AS last_event
FROM
segment_javascript.help_center_opened

error: column "month" does not exist in PG query

My PG query:
SELECT "Tracks"."PageId",
date_trunc("month", "Tracks"."createdAt") AS month,
count(*)
FROM "Tracks"
WHERE "Tracks"."PageId" IN (29,30,31)
GROUP BY month, "Tracks"."PageId"
and my schema:
id, createdAt, updatedAt, PageId
A bit confused as to why I'm receiving this error!
You can use an alias in the where or group by clause. You need to repeat the expression:
SELECT "Tracks"."PageId",
date_trunc('month', "Tracks"."createdAt") AS month,
count(*)
FROM "Tracks"
WHERE "Tracks"."PageId" IN (29,30,31)
GROUP BY date_trunc('month', "Tracks"."createdAt"), "Tracks"."PageId";
Note that the first parameter for date_trunc() is a varchar value, so you need to put that in single quotes, not double quotes.
If you don't want to repeat the expression you can put that into a derived table:
select "PageId", month, count(*)
from (
SELECT "Tracks"."PageId",
date_trunc('month', "Tracks"."createdAt") AS month
FROM "Tracks"
WHERE "Tracks"."PageId" IN (29,30,31)
) t
group by month, "PageId";
Unrelated, but: you should really avoid quoted identifiers. They are much more trouble then they are worth it

Why does casting inside a windowing function return an incorrect type?

The following sample query combining windowing functions and casting does not produce the expected result:
SELECT
visitorid,
LEAD(TIMESTAMP(date)) OVER (PARTITION BY visitorid) AS ts
FROM (
SELECT
"000001" AS visitorid,
"2014-04-28" AS date,
INTEGER(21) AS metric),
(
SELECT
"000001" AS visitorid,
"2014-04-29" AS date,
INTEGER(42) AS metric),
(
SELECT
"000002" AS visitorid,
"2014-04-28" AS date,
INTEGER(84) AS metric)
ORDER BY
visitorid ASC
Given the nature of the lead function, only the first row will contain entries for both visitorid and ts.
If we look closer at the column ts, it's type is an integer instead of a timestamp. That is, I expected to see "2014-04-29 00:00:00 UTC" rather than "1398729600000000" in the first row of ts. Does this mean that casting is incompatible with windowing functions like LEAD?
As pointed out in oulenz answer, the standard SQL dialect fixes this issue and here's an adapted version of the sample query:
SELECT
visitorid,
LEAD(CAST(date AS TIMESTAMP)) OVER (PARTITION BY visitorid ORDER BY visitorid ASC) AS ts
FROM (
SELECT
AS STRUCT "000001" AS visitorid,
"2014-04-28" AS date,
21 AS metric UNION ALL
SELECT
AS STRUCT "000001" AS visitorid,
"2014-04-29" AS date,
42 AS metric UNION ALL
SELECT
AS STRUCT "000002" AS visitorid,
"2014-04-28" AS date,
84 AS metric )
The result is as expected:
It's a known issue that timestamps get converted to integers with some window functions, although when I filed that report, lead was among the functions that worked for me.
You can convert the integer back to timestamp in a superselect. The issue is solved in the standard sql dialect that was released as alpha yesterday. (It doesn't seem to be available for me yet.)

Group by a generated column

I'm trying to group data by minutes, so I tried this query:
SELECT FROM_UNIXTIME(
unix_timestamp (time, 'yyyy-mm-dd hh:mm:ss'), 'yyyy-mm-dd hh:mm') as ts,
count (*) as cnt
from toucher group by ts limit 10;
Then hive tells me no such column,
FAILED: SemanticException [Error 10004]: Line 1:134 Invalid table
alias or column reference 'ts': (possible column names are: time, ip,
username, code)
So is it not supported by hive?
SELECT FROM_UNIXTIME(unix_timestamp (time, 'yyyy-mm-dd hh:mm:ss'), 'yyyy-mm-dd hh:mm') as ts,
count (*) as cnt
from toucher
group by FROM_UNIXTIME(unix_timestamp (time, 'yyyy-mm-dd hh:mm:ss'), 'yyyy-mm-dd hh:mm') limit 10;
or and better
select t.ts, count(*) from
(SELECT FROM_UNIXTIME(unix_timestamp (time, 'yyyy-mm-dd hh:mm:ss'), 'yyyy-mm-dd hh:mm') as ts
from toucher ) t
group by t.ts limit 10;
As is the case with most relational database systems, the SELECT clause is processed after the GROUP BY clause. This means you cannot use columns aliased in the SELECT (such as ts in this example) in your GROUP BY.
There are essentially two ways around this. Both are correct, but some people have preference for one over the other for various reasons.
First, you could group by the original expression, rather than the alias. This results in duplicate code, as you will have the exact same expression in both your SELECT and GROUP BY clause.
SELECT
FROM_UNIXTIME(unix_timestamp(time,'yyyy-mm-dd hh:mm:ss'),'yyyy-mm-dd hh:mm') as ts,
COUNT(*) as cnt
FROM toucher
GROUP BY FROM_UNIXTIME(unix_timestamp(time,'yyyy-mm-dd hh:mm:ss'),'yyyy-mm-dd hh:mm')
LIMIT 10;
A second approach is to wrap your expression and alias in a subquery. This means you do not have to duplicate your expression, but you will have two nested queries and this may have performance implications.
SELECT
ts,
COUNT(*) as cnt
FROM
(SELECT
FROM_UNIXTIME(unix_timestamp(time,'yyyy-mm-dd hh:mm:ss'),'yyyy-mm-dd hh:mm') as ts,
FROM toucher) x
GROUP BY x.ts
LIMIT 10;
Both should have the same result. Which you should use in this case will depend on your particular use; or perhaps personal preference.
Hope that helps.