how to do group by in HIVE - hive

I'm writing a hive query in which I need to group by a few field however I need to select some other fields besides those exist in the group by statement. That is,
select A,B,C from table_name GROUP BY A,B
HIVE complains and says Invalid table alias or column reference C. It requires me to put C in the GROUP BY part but that changes my logic. How can I resolve this issue?

HIVE-Select-statement-and-group-by-clause - group by must be used with some aggregate function like count, sum etc.
so there must be one of the aggregate calculation on column C
example:
select A,B,count(C) as Total_C from table_name GROUP BY A,B;
select A,B,SUM(C) as Total_C from table_name GROUP BY A,B;

You have to join after group by.
select T1.*, t2.c from (select a,b, count(*) as SomeAggFunc from table group by a,b) T1;
<join condition> table t2 on t1.a=t2.a and t1.b=t2.b

You can try using cluster by instead of group by
select A,B,C from table_name CLUSTER BY A,B

Related

Aggregate column from CTE cannot be used in WHERE clause in the query in PostgreSQL

My query follows this structure:
WITH CTE AS (
SELECT t1.x, COUNT(t1.y) AS count
FROM table1 t1
GROUP BY t1.x
)
SELECT CTE.x, CTE.count AS newCount, t2.count AS oldCount
FROM table2 t2 JOIN CTE ON t2.x = CTE.x
WHERE t2.count != CTE.count;
I get the following error: [42803] ERROR: aggregate functions are not allowed in WHERE
It looks like the CTE.count is the aggregate that triggers this error. Aren't CTEs supposed to be calculated before the main query? How to rewrite the query to avoid this?
PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit
The t2.count is being interpreted as an aggregate COUNT() function, and your t2 table does not have a column called count.
Make sure that your table does actually have a count column, or make sure to compute it's aggregate count on another CTE before joining, and then comparing the the results. Also avoid using the alias "count", like the following:
WITH CTE AS (
SELECT t1.x, COUNT(t1.y) AS total
FROM table1 t1
GROUP BY t1.x
),
CTE2 AS (
SELECT t2.x, COUNT(t2.y) AS total
FROM table2 t2
GROUP BY t2.x
)
SELECT
CTE1.x,
CTE1.total AS newCount,
CTE2.total AS oldCount
FROM
CTE2
JOIN CTE1 ON CTE2.x = CTE1.x
WHERE
CTE2.total != CTE1.total;
Looks like it is the "t2.count" that causes the issue.
On dbfiddle I can reproduce the issue ONLY when there is no column named "count" in the table2.
In other words, the error occurs only when table 2 defined like that:
create table table2 (x int, y int);
However if I added the "count" column, the error is gone
create table table2 (x int, y int, count int);
I believe when there is no such column, the postgres handles "count" as an aggregate function and throws the error.
So, my solution would be to check if such column is present and to never use preserved keywords as column names

How to create a select clause using a subquery

I have the following sql statement:
WITH
subquery AS (
select distinct id from a_table where some_field in (1,2,)
)
select id from another_table where id in subquery;
Edit
JOIN is not an option (this is just a reduced example of a bigger query)
But that obviously does not work. The id field exists in both tables (with a different name, but values are the same: numeric ids). Basically what I want to do is filter by the result of the subquery, like a kind of intersection.
Any idea how to write that query in a correct way?
You need a subquery for the second operand of IN that SELECTs from the CTE.
... IN (SELECT id FROM subquery) ...
But I would recommend to rewrite it as a JOIN.
Are you able to join on ID and then filter on the Where clause?
select a.id
from a.table
inner join b.table on a.id = b.id
where b.column in (1,2)
Since you only want the id from another_table you can use exists
with s as (
select id
from a_table
where some_field in (1,2)
)
select id
from another_table t
where exists ( select * from s where s.id=t.id )
But the CTE is really redundant since all you are doing is
select id
from another_table t
where exists (
select * from a_table a where a.id=t.id and a.some_field in (1,2)
)

Table.* notation does not work in a 'group by' query

On an oracle database, the Table.* notation does not work inside a 'select..group by..' query.
This query with no * works :
select A.id from TABLE_A A INNER JOIN TABLE_B B on A.id=B.aid group by A.id
This one with a * does not :
select A.* from TABLE_A A INNER JOIN TABLE_B B on A.id=B.aid group by A.id
The output is
00979. 00000 - "not a GROUP BY expression"
Why does this query not work? Is there a simple workaround?
Everything you selecting except agregate functions (MIN, MAX, SUM, AVG, COUNT...) must be in Group by
Yes, there is a workaround.
Assuming that each id in A is unique, then you don't even need to use group by, just:
select * from A
where id in (
select id from b
);
If id are not unique in A table, then you can simulate MySql functionality with this query:
select * from A
where rowid in (
select min( a.rowid )
from a
join b on a.id = b.id
group by a.id
);
Here is a link to SQL Fiddle demo
Here is a link to MySql documentation where their extension to group by is explained: http://dev.mysql.com/doc/refman/5.1/en/group-by-extensions.html
Pay attention to this fragment:
You can use this feature to get better performance by avoiding
unnecessary column sorting and grouping. However, this is useful
primarily when all values in each nonaggregated column not named in
the GROUP BY are the same for each group. The server is free to choose
any value from each group, so unless they are the same, the values
chosen are indeterminate. Furthermore, the selection of values from
each group cannot be influenced by adding an ORDER BY clause. Sorting
of the result set occurs after values have been chosen, and ORDER BY
does not affect which values within each group the server chooses.
A group by expression must include all the columns you select. So, if the table has 3 columns (column1, column2 and column3), you have to group by all of them like this: group by Column1, Column2, Column3. The * means you select all the columns, so add all of them in the group by expression.

oracle intersect doesn't work

In the Oracle SQL, why this code doesn't compile? Oracle doesn't support intersect? intersect only takes one column value?
assume two table have same column types.
Thanks
select B.name, B.id from tmp_B B where B.id in (select distinct id from tmp_A);
intersect
select distinct A.name, A.id from tmp_A A;
error message
Error report:
Unknown Command
There is a syntax error in your statement. You have an extra semicolon after the initial SELECT and before the INTERSECT.
select B.name, B.id from tmp_B B where B.id in (select distinct id from tmp_A)
intersect
select distinct A.name, A.id from tmp_A A
should be a valid SQL statement assuming that ID and NAME have the same data types in both tables.

How can I reference a single table multiple times in the same query?

Sometimes I need to treat the same table as two separate tables. What is the solution?
You can reference, just be sure to use a table alias
select a.EmployeeName,b.EmployeeName as Manager
from Employees A
join Employees B on a.Mgr_id=B.Id
Use an alias like a variable name in your SQL:
select
A.Id,
A.Name,
B.Id as SpouseId,
B.Name as SpouseName
from
People A
join People B on A.Spouse = B.id
Use an alias:
SELECT t1.col1, t2.col3
FROM tbl t1
INNER JOIN tbl t2
ON t1.col1 = t2.col2
Alias is the most obvious solution
SELECT * FROM x1 AS x,y1 AS y
However if the table is the result of a query a common table expressions is quite usefull
;WITH ctx AS
( select * from z)
SELECT y.* FROM ctx AS c1,ctx AS c2
A third solution -- suitable when your query lasts a long time -- is temporary tables:
SELECT *
INTO #monkey
FROM chimpanzee
SELECT * FROM #monkey m1,#monkey m2
DROP TABLE #MONKEY
Note a common table expression is only available for one query (the query directly after it), and temporary tables last for the whole batch.