Distinct count and group by in HIVE - hive

I am very new to HIVE and have an issue with distinct count and GROUP BY.
I want to calculate maximum temperature from temperature_data table corresponding to those years which have at least 2 entries in the table-
I tried with below query but it is not working
select
SUBSTRING(full_date,7,4) as year,
MAX(temperature) as temperature
from temperature_data
where count(distinct(SUBSTRING(full_date,7,4))) >= 2
GROUP BY SUBSTRING(full_date,7,4);
I am getting an error-
FAILED: SemanticException [Error 10128]: Line 2:0 Not yet supported place for UDAF 'count'
Below is input-
year,zip,temperature
10-01-1990,123112,10
14-02-1991,283901,11
10-03-1990,381920,15
10-01-1991,302918,22
12-02-1990,384902,9
10-01-1991,123112,11
14-02-1990,283901,12
10-03-1991,381920,16
10-01-1990,302918,23
12-02-1991,384902,10
10-01-1993,123112,11

You should use HAVING keyword instead to set a condition on variable you're using for grouping.
Also, you can benefit of using subqueries. See below.
SELECT
year,
MAX(t1.temperature) as temperature
FROM
(select SUBSTRING(full_date,7,4) year, temperature from temperature_data) t1
GROUP BY
year
HAVING
count(t1.year) > 2;

#R.Gold, We can try to simplify the above query without using sub-query as below:
SELECT substring(full_date,7) as year, max(temperature)
FROM your-hive-table
GROUP BY substring(full_date,7)
HAVING COUNT(substring(full_date,7)) >= 2
And, fyi - we can't use aggregate functions with WHERE clause.

Related

SQL error code in Athena Your query has the following error(s): SYNTAX_ERROR: line 5:8: Column 'amount' cannot be resolved

In AWS Athena I have the SQL query as follows:
select licence, count(distinct (id)) as amount
from "database_name"
where YEAR(column_year) = 2021
group by licence
having amount > 10
order by amount desc
*Then I get the error:
SYNTAX_ERROR: line 5:8: Column 'amount' cannot be resolved. This query ran against the "database_name", unless qualified by the query.*
What am I doing wrong?
2 Things.
You cannot use aliases in Having clause, So you have to use exact column calculation.
Distinct is not a function, So you can use it without parenthesis.
select licence, count(distinct id) as amount
from "database_name"
where YEAR(column_year) = 2021
group by licence
having count(distinct id) > 10
order by amount desc

SQL Query not working

I seem to be getting this error while trying to run the below query:
SELECT
to_char(EFFECTIVE_DT,'YYYY-MM') as YYYYMM,
--EFFECTIVE_DT,
AH01_PAYMENT_STATUS_CTD,
TSYS_ACCT_ID
FROM OIS_TSYS.AH_CYCLE_HIST
WHERE 1=1
AND EFFECTIVE_DT BETWEEN '01-MAY-2017' AND '31-MAY-2017'
GROUP BY 2
ORDER BY 1
error: ORA-00979: not a GROUP BY expression
I am trying to group by date as at the moment i get the results daily for each individual account.
Result set:
65589 N 03-MAY-17
65590 S 03-MAY-17
65591 M 03-MAY-17
65592 F 03-MAY-17
65617 G 03-MAY-17
Any help be amazing.
Best,
Saad
When you "group by 2", all other columns must have an aggregate function like (sum, avg, min, max,..)
The "1=1" is pretty useless
To get the desired result use the below query:
When you apply group by clause in any query you cannot just put one column in the group by clause if there are more than one colum in the select clause apart from the aggregate functions like sum, count, min, max etc. So in your case you have to put all the three columns in group by that you selected in the select clause.
SELECT
TSYS_ACCT_ID,
AH01_PAYMENT_STATUS_CTD,
to_char(EFFECTIVE_DT,'YYYY-MM') as YYYYMM
FROM OIS_TSYS.AH_CYCLE_HIST
WHERE EFFECTIVE_DT BETWEEN '01-MAY-2017' AND '31-MAY-2017'
GROUP BY
TSYS_ACCT_ID,
AH01_PAYMENT_STATUS_CTD,
to_char(EFFECTIVE_DT,'YYYY-MM')
ORDER BY 1

group by date part of datetime and get number of records for each

I have this so far:
select created_at,
DATEDIFF(TO_DATE(current_date()), TO_DATE(sales_flat_order.created_at)) as delay,
count(*) over() as NumberOfOrders
FROM
magentodb.sales_flat_order
WHERE
status IN ( 'packed' , 'cod_confirmed' )
GROUP BY TO_DATE(created_at)
But this is not working.
syntax error:
Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'created_at': (possible column names are: (tok_function to_date (tok_table_or_col created_at)))
count(*) does not give sum for each grouped by date but instead all of the rows.
Note : I am actually using hive but it is exactly like sql when it comes to queries
Try this:
select created_at,
DATEDIFF(TO_DATE(current_date()), TO_DATE(sales_flat_order.created_at)) as delay,
count(*) as NumberOfOrders
FROM
magentodb.sales_flat_order
WHERE
status IN ( 'packed' , 'cod_confirmed' )
GROUP BY Date(created_at)
I think you want to use date part(including year, month and day) of created_at for grouping.
select
date(created_at) as created_at_day,
datediff(curdate(), sales_flat_order.created_at) as delay,
count(*) as numberOfOrders
from magentodb.sales_flat_order
WHERE status IN ('packed', 'cod_confirmed' ) GROUP BY created_at_day
This query will show only the first order created on the day. Because you are grouping by the day. You can use average to find average delay of orders created for the day.
My phone won't allow me to post comments. But try this link it might guide you the right way.
stackoverflow.com/questions/29704904/invalid-table-alias-or-column-reference-b

Not a group by function at a cumulative query

I'm making a cumulative query, which shows the evolution of clients in my database. To get these query, I use the year and the week of year they joined in the client database.
I have following query to search for relevant data:
SELECT DD.CAL_YEAR, DD.WEEK_OF_YEAR, SUM(COUNT(DISTINCT FAB.ID)) OVER ( ORDER BY DD.CAL_DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS "Number of account statements"
FROM CLIENT_DATABASE FAB
JOIN DIM_DATE DD ON FAB.BALANCE_DATE_ID = DD.ID
GROUP BY DD.CAL_YEAR, DD.WEEK_OF_YEAR;
But when I compile this query, I get following error:
Error: ORA-00979: not a GROUP BY expression
SQLState: 42000 ErrorCode: 979
How can I fix this?
Since you are grouping by DD.CAL_YEAR, DD.WEEK_OF_YEAR, you can't use DD.CAL_DATE in the order by clause of your cumulative sum function.
It's hard for me to say exactly what you are trying to do without fully understanding your data. But, logically, it does seem like you should be able to simply use DD.CAL_YEAR, DD.WEEK_OF_YEAR in the order by clause instead of DD.CAL_DATE, and still get the results the way you are expecting.
So something like this:
SUM(COUNT(DISTINCT FAB.ID)) OVER ( ORDER BY D.CAL_YEAR, DD.WEEK_OF_YEAR ...

Using a timestamp function in a GROUP BY

I'm working with a large transaction data set and would like to group a count of individual customer transactions by month. I am unable to use the timestamp function in the GROUP BY and return the following error:
BAD_QUERY (expression STRFTIME_UTC_USEC([DATESTART], '%b') in GROUP BY is invalid)
Is there a simple workaround to achieve this or should I build a calendar table (which may be the simplest option)?
You have to use an alias:
SELECT STRFTIME_UTC_USEC(DATESTART, '%b') as month, COUNT(TRANSACTION)
FROM datasetId.tableId
GROUP BY month
#Charles is correct but as an aside you can also group by column number.
SELECT STRFTIME_UTC_USEC(DATESTART, '%b') as month, COUNT(TRANSACTION) as count
FROM [datasetId.tableId]
GROUP BY 1
ORDER BY 2 DESC