Filter count(*) over certain number? - hive

If I try to run either of the queries below then I get the message:
Error while compiling statement: FAILED: ParseException line 5:0
missing EOF at 'where' near 'nino_dtkn'
Which suggests to me that I can't use the newly created count variable in the same query.
Is my conclusion correct?
What can I do to fix it?
I don't want to create a new table - I want to use this as a subquery to merge onto a second table.
select count(*) as cnt,
[variable 1]
from [source table]
group by [variable 1]
where count(*) >= 20;
select count(*) as cnt,
[variable 1]
from [source table]
group by [variable 1]
where cnt >= 20;

Use HAVING clause
select count(*) as cnt,[variable 1]
from [source table]
group by [variable 1]
having count(*) >= 20;

I am not sure about your expected results . WHERE CLAUSE should always come before GROUP BY FUNCTION .
So your query can be rewritten as the one mentioned below :
select count(*) as cnt,[variable 1]
from [source table]
where count(*) >= 20
group by [variable 1]
;

Related

Hive - How to display a warning message when the table has zero results

I need to create a query to print a warning message when the table is empty, but if it is empty, how can it print anything?
HIVE.hql:
select
x,y,count(*)
from
table1
group by
x,y
having
count(*)=0
Well, you can do:
select 'Oops! No rows!
from (select count(*) as cnt
from t
) t
where cnt = 0;
You can also do:
select 'Oops! No rows'
from t
having count(*) = 0;
However, I find having with no group by to be awkward.

Hive - select rows within 1 year of earliest date

I am trying to select all rows in a table that are within 1 year of the earliest date in the table. I'm using the following code:
select *
from baskets a
where activitydate < (select date_add((select min(activitydate) mindate_a from baskets), 365) date_b from baskets)
limit 10;
but get the following error message:
Error while compiling statement: FAILED: ParseException line 1:55 cannot recognize input near 'select' 'date_add' '(' in expression specification
Total execution time: 00:00:00.338
Any suggestions?
EDIT:
With this code:
select *
from baskets a
where activitydate < (select date_add(min(activitydate), 365) from baskets)
limit 10;
I'm getting this error:
Error while compiling statement: FAILED: ParseException line 1:55 cannot recognize input near 'select' 'date_add' '(' in expression specification
I'd be tempted to use window functions:
select b.*
from (select b.*, min(activity_date) as min_ad
from baskets b
) b
where activity_date < add_months(min_ad, 12);
If you really want your syntax to work, try reducing the number of selects:
where activitydate < (select date_add(min(activitydate), 365) from baskets)
Use JOINs instead of select in Sub-query. I don't think Hive supports select in where clause with < condition. Only IN and EXISTS could be used as of Hive 0.13.
: Language Manual SubQueries
SELECT a.*
FROM baskets a
JOIN (SELECT DATE_ADD(MIN(b.activitydate), 365) maxdate
FROM baskets) b
ON a.activitydate < b.maxdate
LIMIT 10;

Group By & Having vs. SubQuery (Where Count is Greater Than 1)

I'm struggling here trying to write a script that finds where an order was returned multiple times by the same associate (count greater than 1). I'm guessing my syntax with the subquery is incorrect. When I run the script, I get a message back that the "SELECT failed.. [3669] More than one value was returned by the subquery."
I'm not tied to the subquery, and have tried using just the group by and having statements, but I get an error regarding a non-aggregate value. What's the best way to proceed here and how do I fix this?
Thank you in advance - code below:
SEL s.saletran
, s.saletran_dt SALE_DATE
, r.saletran_id RET_TRAN
, r.saletran_dt RET_DATE
, ra.user_id RET_ASSOC
FROM salestrans s
JOIN salestrans_refund r
ON r.orig_saletran_id = s.saletran_id
AND r.orig_saletran_dt = s.saletran_dt
AND r.orig_loc_id = s.loc_id
AND r.saletran_dt between s.saletran_dt and s.saletran_dt + 30
JOIN saletran rt
ON rt.saletran_id = r.saletran_id
AND rt.saletran_dt = r.saletran_dt
AND rt.loc_id = r.loc_id
JOIN assoc ra --Return Associate
ON ra.assoc_prty_id = rt.sls_assoc_prty_id
WHERE
(SELECT count(*)
FROM saletran_refund
GROUP BY ORIG_SLTRN_ID
) > 1
AND s.saletran_dt between '2015-01-01' and current_date - 1
Based on what you've got so far, I think you want to use this instead:
where r.ORIG_SLTRN_ID in
(select
ORIG_SLTRN_ID
from
saletran_refund
group by ORIG_SLTRN_ID
having count (*) > 1)
That will give you the ORIG_SLTRN_IDs that have more than one row.
you don't give enough for a full answer but this is a start
group by s.saletran
, s.saletran_dt SALE_DATE
, r.saletran_id RET_TRAN
, r.saletran_dt RET_DATE
, ra.user_id RET_ASSOC
having count(distinct(ORIG_SLTRN_ID)) > 0
this does return more the an one row
run it
SELECT count(*)
FROM saletran_refund
GROUP BY ORIG_SLTRN_ID

HIVE: Error in GROUP BY Key

hive -e "select a.EMP_ID,
count(distinct c.SERIAL_NBR) as NUM_CURRENT_EMP,
count(distinct c.SERIAL_NBR)/count(distinct a.SERIAL_NBR) as DISTINCT_EMP
from ORDERS_COMBINED_EMPLOYEES as a
inner join ORDERS_EMPLOYEE_STATS as b
on a.CPP_ID = b.CPP_ID
left join ( select SERIAL_NBR, MIN(TRAN_DT) as TRAN_DT
from EMP_TXNS
group by SERIAL_NBR
) c
on c.SERIAL_NBR = a.SERIAL_NBR
where c.TRAN_DT > a.LAST_TXN_DT
group by a.EMP_ID
having (
(NUM_CURRENT_EMP >= 25 and DISTINCT_EMP > 0.01)
) ; " > EMPLOYEE_ORDERS.txt
Getting error message,
"FAILED: SemanticException [Error 10025]: Line 15:31 Expression not in GROUP BY key '0.01'".
When I ran the same query with just one condition in HAVING clause as NUM_CURRENT_EMP >= 25, the query ran fine without any issues. NUM_CURRENT_EMP is a int type and DISTINCT_EMP is float in the table where I am trying to insert the results. Breaking my head.
Any help is appreciated.
What happens if you replace the aliases in the having with the expressions that define them?
having count(distinct c.SERIAL_NBR) >= 25 and
count(distinct c.SERIAL_NBR)/count(distinct a.SERIAL_NBR) > 0.01

Determine existence of results in jet SQL?

In Jet, I want to test if certain conditions return any results.
I want a query which returns exactly one record: "true" if there are any results, "false" otherwise.
This works in MS SQL:
SELECT
CASE
WHEN EXISTS(SELECT * FROM foo WHERE <some condition>)
THEN 1
ELSE 0
END;
This is what I have tried in Jet:
SELECT IIF(EXISTS(SELECT * FROM foo WHERE <some condition>), 0, 1);
which gives me the error:
Reserved error (-3025); there is no message for this error.
Any ideas?
NOTE
I don't want to select "true" multiple times by tacking on a FROM clause at the end, because it could be slow (if the FROM table had many records) or undefined (if the table had 0 records).
How about:
SELECT TOP 1 IIF(EXISTS(
SELECT * FROM foo
WHERE <some condition>), 0, 1) As f1
FROM foo
Perhaps more clearly:
SELECT TOP 1 IIF(EXISTS(
SELECT * FROM foo
WHERE <some condition>), 0, 1) As F1
FROM MSysObjects
You might be able to use a count
SELECT DISTINCT IIF((SELECT COUNT(*) AS Result FROM [Data Set]), 1, 0) FROM [Data Set];
Most EXISTS queries can be re-written as a left join:
SELECT
CASE
WHEN foo.col is NULL
THEN 0
ELSE 1
END;
... LEFT JOIN foo on <where condition>