Subsets count on Hive using CTE - hive

I want to count the rows in a Hive table and at the same time, count the subsets (based on certain conditions in WHERE clause) in a single query. I came across CTE in this post, which I think applies to non-Hive SQL. I've researched a bit and found out that Hive has CTE. However this form does not work in Hive when I tried:
WITH MY_TABLE AS (
SELECT *
FROM orig_table
WHERE base_condition
)
SELECT
(SELECT COUNT(*) FROM MY_TABLE) AS total,
(SELECT COUNT(*) FROM MY_TABLE WHERE cond_1) AS subset_1,
...
(SELECT COUNT(*) FROM MY_TABLE WHERE cond_n) AS subset_n;
Does anyone have a workaround or similar working idea for Hive?

No need for Common table expressions. Use case when clauses to sum over conditions:
select count(1) as total
, sum(case when cond_1 then 1 else 0 end) as subset_1
--...
, sum(case when cond_n then 1 else 0 end) as subset_n
from orig_table
where base_cond
;

Related

Can I Select DISTINCT on 2 columns and Sum grouped by 1 column in one query?

Is it possible to write one query, where I would group by 2 columns in a table to get the count of total members plus get a sum of one column in that same table, but grouped by one column?
For example, the data looks like this
I want to get a count on distinct combinations of columns "OHID" and "MemID" and get the SUM of the "Amount" column grouped by OHID. The result is supposed to look like this
I was able to get the count correct using this query below
SELECT count(*) as TotCount
from (Select DISTINCT OHID, MemID
from #temp) AS TotMembers
However, when I try to use this query below to get all the results together, I am getting a count of 15 and a totally different total sum.
SELECT t.OHID,
count(TotMembers.MemID) as TotCount,
sum(t.Amount) as TotalAmount
from (Select DISTINCT OHID, MemID
from #temp) AS TotMembers
join #temp t on t.OHID = TotMembers .OHID
GROUP by t.OHID
If I understand correctly, you want to consider NULL as a valid value. The rest is just aggregation:
select t.ohid,
(count(distinct t.memid) +
(case when count(*) <> count(t.memid) then 1 else 0 end)
) as num_memid,
sum(t.amount) as total_amount
from #temp t
group by t.ohid,
The case logic might be a bit off-putting. It is just adding 1 if any values are NULL.
You might find this easier to follow with two levels of aggregation:
select t.ohid, count(*), sum(amount)
from (select t.ohid, t.memid, sum(t.amount) as amount
from #temp t
group by t.ohid, t.memid
) t
group by t.ohid

Join Two Count(*) Tables with No Relation in SQL Server

I am trying to combine the results of a count(*) statement and a count(*) with a where clause on a SQL Server Table into a single table.
I have a union statement that bring together the two queries one of top of another.
SELECT count(*) FROM [dbo].asma a
where [MLR] in ('y')) l
union
SELECT count (*) as 'Total' FROM [dbo].asma]
This post of solutions I looked at, but couldn't piece together a solution that would present these side by side. How would you do this?
What I need is this output:
You can do conditional aggregation instead :
select sum(case when MLR = 'y' then 1 else 0 end) as Active, count(*) as Total
from dbo.asma a;

In SQL, how do I create new column of values for each distinct values of another column?

Something like this: SQL How to create a value for a new column based on the count of an existing column by groups?
But I have more than two distinct values. I have a variable n number of distinct values, so I don't always know have many different counts I have.
And then in the original table, I want each row '3', '4', etc. to have the count i.e. all the rows with the '3' would have the same count, all the rows with '4' would have the same count, etc.
edit: Also how would I split the count via different dates i.e. '2017-07-19' for each distinct values?
edit2: Here is how I did it, but now I need to split it via different dates.
edit3: This is how I split via dates.
#standardSQL
SELECT * FROM
(SELECT * FROM table1) main
LEFT JOIN (SELECT event_date, value, COUNT(value) AS count
FROM table1
GROUP BY event_date, value) sub ON main.value=sub.value
AND sub.event_date=SAFE_CAST(main.event_time AS DATE)
edit4: I wish PARTITION BY was documented somewhere better. Nothing seems to be widely written on BigQuery or anything with detailed documentation
#standardSQL
SELECT
*,
COUNT(*) OVER (PARTITION BY event_date, value) AS cnt
FROM table1;
The query that you give would better be written using window functions:
SELECT t1.*, COUNT(*) OVER (PARTITION BY value) as cnt
FROM table1 t1;
I am not sure if this answers your question.
If you have another column that you want to count as well, you can use conditional aggregation:
SELECT t1.*,
COUNT(*) OVER (PARTITION BY value) as cnt,
SUM(CASE WHEN datecol = '2017-07-19' THEN 1 ELSE 0 END) OVER (PARTITION BY value) as cnt_20170719
FROM table1 t1;

Returning 0 or 1 for a SQL duplicate query

Working with Teradata if that matters...
What I have is a duplicate check that looks like this:
SELECT ID, COUNT(*)
FROM TBL_A
HAVING COUNT(*) > 1
GROUP BY ID;
If there are no duplicates it returns 0 rows, if there are duplicates it shows what they are. That's fine, but what I want is the return to either be 0 (if no duplicates) or 1 if duplicates are found. That's it.
Any ideas? Thanks!
I think the easiest way is a case:
select (case when count(id) = count(distinct id) then 0 else 1 end)
from tbl_a;
Note: This ignores id when it has a NULL value. If you need to take that into account, it is easy to modify the query.
If there's a multi-column key you can't use Gordon's apporach as aggregate functions only work on a single column.
A possible workaround would be combining those columns into one like this
COUNT(column1 || column2 || column3)
but it's probably not very efficient.
Otherwise you need to add another COUNT using a Derived Table:
select case when count(*) = 0 then 0 else 1 end
from
(
SELECT column1, column2, column3, COUNT(*)
FROM TBL_A
GROUP BY 1,2,3
HAVING COUNT(*) > 1
) as dt
You should compare resource usage, should be similar to COUNT(DISTINCT ID) for single column, but less CPU for multi-column.

unique count of the columns?

i want to get a unique count of the of multiple columns containing the similar or different data...i am using sql server 2005...for one column i am able to take the unique count... but to take a count of multiple columns at a time, what's the query ?
You can run the following selected, getting the data from a derived table:
select count(*) from (select distinct c1, c2, from t1) dt
To get the count of combined unique column values, use
SELECT COUNT(*) FROM TableName GROUP BY UniqueColumn1, UniqueColumn2
To get the unique counts of multiple individual columns, use
SELECT COUNT(DISTINCT Column1), COUNT(DISTINCT Column2)
FROM TableName
Your question is not clear what exactly you want to achieve.
I think what you're getting at is individual SUMS from two unique columns in one query. I was able to accomplish this be using
SELECT FiscalYear, SUM(Col1) AS Col1Total, SUM(Col2) AS Col2Total
FROM TableName
GROUP BY FiscalYear
If your data is not numerical in nature, you can use CASE statements
SELECT FiscalYear, SUM(CASE WHEN ColA = 'abc' THEN 1 ELSE 0 END) AS ColATotal,
SUM(CASE WHEN ColB = 'xyz' THEN 1 ELSE 0 END) AS ColBTotal
FROM TableName
GROUP BY FiscalYear
Hope this helps!