HIVE: Replace empty results by 0 in group by statements - hive

I'm a new Hive user, and need to aggregate the sum of amounts for a given table. Consider the simplified example below:
SELECT day, sum(amount) FROM tableX WHERE columnA = 'RareValue' GROUP BY day;
Suppose that it's possible that there is no row entry which matches the condition in the WHERE clause for some dates. And so the query result will skip those days.
For example, this is the result I get:
date amount
2018-01-15 230
2018-01-13 210
2018-01-12 140
2018-01-11 222
But this is the desired result:
date amount
2018-01-15 230
2018-01-14 0
2018-01-13 210
2018-01-12 140
2018-01-11 222
I tried this to generate a sequence of dates and then use LEFT JOIN and COALESCE to fill empty dates by zeros. However, the performance was terrible slow. What is the best approach for this?

Supposing that you are trying to exclude the whole day in case when your where condition is true, you can do something like
select
day,
if(max(mycondition) = 0, sum(amount), 0) as mysum from
(
select day, amount,
if(columnA = 'RareValue', 1, 0) as mycondition
FROM tableX
) t GROUP BY day;
I did not have the chance to test it :)

If I correctly understood you all needed days are presented in tableX table. So, I advise first select all rows where columnA is not equal 'RareValue' and that UNION it with your query.
SELECT day, 0 FROM tableX WHERE columnA != 'RareValue'
UNION
SELECT day,sum(amount) from tableX WHERE columnA = 'RareValue' GROUP BY day;
if the days from the first select repeats you can add 'distinct'

Related

count based on store_A and all stores

I would like to count:
all count_products for store_A
the total count_products which includes all store_id's even store_A as total_count_products
Main_table
date store_id count_prroducts
2019-01-01 A 13
2019-01-01 B 34
2019-01-01 C 63
2019-01-01 D 10
Output_table
date store_A_count_products total_count_products
2019-01-0 13 120
Start out by selecting the date column without any modifications.
For store_A_count_products, basically what you need to do is add up all of the count_products whenever the store_id is A. You can do this with a case statement:
case when store_id = 'A' then count_products else 0 end
This is basically an IF/ELSE situation and will return a 0 for any row that doesn't have A in the store_id column.
If you wrap that up in a SUM(), you will add all the rows together.
For total_count_products, you just need to wrap a SUM() around count_products. This will add up all rows regardless of the status of any other column.
Finally, you need to group by the date column. The group by is a means to split the aggregated data across unaggregated columns.
The reason this works is because it gives you one row for each date, the summed total of products for Store A and the summed total of all products.
Select
date,
Sum(case when store_id = 'A' then count_products else 0 end) as store_A_count_products,
SUM(count_products) as total_count_products
From main_table
Group by date;

how to add monthly count average

I am looking for all counts when dimsyermid=-1 and also make a new column calculate avg per month. Below are my current queries and result, I don't know how to add a new column calculate avg per month.
query:
select DimSystemID, EligibleYM, count(*)
from dbo.table1
where DimSystemID=-1
group by DimSystemID, EligibleYM
order by 2 desc, 1
Result table
DimSystemID EligibleYM (No column name)
-1 202001 75
-1 201912 70
-1 201911 67
-1 201910 67
-1 201909 59
Welcome to Stack. Making the assumption that you have some values that you want to average in your data set but not shown in your question, in MS SQL, you would just create another computed column that does the math:
select DimSystemID, EligibleYM, count(*), [new computed column here as AVG]
from dbo.table1
where DimSystemID=-1
group by DimSystemID, EligibleYM
order by 2 desc, 1
with an example:
select DimSystemID, EligibleYM, count(*), AVG(MONTH DATA HERE)
An example (anonymized) of your data would help.
MSSQL AVG Document

Need sum of a column from a filter condition for each row

Need to get total sum of defect between main_date column and past 365 day (a year) from it, if any, for a single ID.
And The value need to be populated for each row.
Have tried below queries and tried to use CSUM also but it's not working:
1) select sum(Defect) as "sum",Id,MAIN_DT
from check_diff
where MAIN_DT between ADD_MONTHS(MAIN_DT,-12) and MAIN_DT group by 2,3;
2)select Defect,
Type1,
Type2,
Id,
MAIN_DT,
ADD_MONTHS(TIM_MAIN_DT,-12) year_old,
CSUM(Defect,MAIN_DT)
from check_diff
where
MAIN_DT between ADD_MONTHS(MAIN_DT,-12) and MAIN_DT group by id;
The expected output is as below:
Defect Type1 Type2 Id main_dt sum
1 a a 1 3/10/2017 1
99 a a 1 4/10/2018 99
0 a b 1 7/26/2018 99
1 a b 1 11/21/2018 100
1 a c 2 12/20/2018 1
Teradata doesn't support RANGE for Cumulative Sums, but you can rewrite it using a Correlated Scalar SUbquery:
select Defect, Id, MAIN_DT,
( select sum(Defect) as "sum"
from check_diff as t2
where t2.Id = t1.Id
and t2.MAIN_DT > ADD_MONTHS(t1.MAIN_DT,-12)
and t2.MAIN_DT <= t1.MAIN_DT group by 2,3;
) as dt
from check_diff as t1
Performance might be bad depending on the overall number of rows and the number of rows per ID.

Table not aggregating properly

I am trying to create a list of percentages from a dataset of transactional data using SAS/SQL to understand how a specific department contributes to overall sales count for a given quarter. For example, if there were 100 sales of Store ID 234980 and 20 of those were in department a in Q4 of 2006, then the list should output:
Store ID 234980 , 20%.
This is the code I am using to achieve this result.
data testdata;
set work.dataset;
format PostingDate yyq.;
run;
PROC SQL;
CREATE TABLE aggregatedata AS
SELECT DISTINCT testdata.ID,
SUM(CASE
WHEN testdata.Store='A' THEN 1 ELSE 0
END)/COUNT(Store) as PERCENT,
PostingDate
FROM work.testdata
group by testdata.ID, testdata.PostingDate;
QUIT;
However, the output I am receiving is more like this:
StoreID DepartmentA Quarter
100 1 2014Q1
100 0 2014Q2
100 1 2014Q2
100 0 2014Q2
100 0 2014Q2
100 0 2014Q2
101 1 2015Q3
101 0 2015Q3
101 0 2015Q4
Why does my code not aggregate to the store level?
If you want to group by QTR then you need to transform your date values into quarter values. Otherwise '01JAN2017'd and '01FEB2017'd would be seen as two distinct values even though they would both display the same using the YYQ. format.
proc sql;
create table aggregatedata as
select id
, intnx('qtr',postingdate,0,'b') as postingdate format=yyq.
, sum(store='A')/count(store) as percent
from work.testdata
group by 1,2
;
quit;
You do not want to set both DISTINCT and GROUP BY
Perhaps try:
select t.testingdate
,t.StoreID
,t.Department
,count(t.*) / count(select t2.*
from testdata t2
where t.testingdate = t2.testingdate
and t.StoreID = t2.StoreID) AS Percentage
from testdata t
group by t.testingdate
,t.StoreID
,t.Department
Alternately you could use a left join, which may be more efficient. The nested select to count all records, regardless of department may be more clear to read.

Get a single max date if dates are not unique

For sql 2000,
Very similar to what I asked here
Get distinct max date using SQL
But this time the dates aren't unique so for this table pc_bsprdt_tbl
pc_bsprhd_key pc_bsprdt_shpiadt pc_bsprdt_prod
21ST 99-00 2001-04-30 23:59:59.000 72608-12895
21ST 99-00 2001-04-30 23:59:59.000 72608-12910
AFCC990915 1999-09-01 00:00:00.000 72608-12115
AFCC990915 1999-09-01 00:00:00.000 CHU99-01514
AFCC990915 1999-09-01 00:00:00.000 POP99-01514
I would like returned
21ST 99-00 2001-04-30 23:59:59.000
AFCC990915 1999-09-01 00:00:00.000
Now, the pc_bsprdt_prod is unique so what I have tried is using the max for the product like this to give me uniqueness.
Select T.pc_bsprhd_key, T.pc_bsprdt_shpiadt
From pc_bsprdt_tbl As T
Join (
Select pc_bsprhd_key, Max( T1.pc_bsprdt_shpiadt ) As MaxDateTime, Max(pc_bsprdt_prod) as Product
From pc_bsprdt_tbl As T1
Group By T1.pc_bsprhd_key
) As Z
On Z.pc_bsprhd_key = T.pc_bsprhd_key
And Z.MaxDateTime = T.pc_bsprdt_shpiadt
AND Z.Product = T.pc_bsprdt_prod
It seems like it works :)
Is there a way to do it though just using the date? Maybe a top 1 in there somewhere?
SELECT pc_bsprhd_key, MAX(pc_bsprdt_shpiadt)
FROM pc_bsprdt_tbl
GROUP BY pc_bsprhd_key;
That might not be working as you think it is. That will give you the MAX(Date) and MAX(prod) which might not be on the same row. Here is an example:
CREATE TABLE #Test
(
a int,
b date,
c int,
)
INSERT INTO #Test(a, b, c)
SELECT 1, '01/01/2010', 3 UNION ALL
SELECT 1, '01/02/2010', 2 UNION ALL
SELECT 1, '01/03/2010', 1 UNION ALL
SELECT 2, '01/01/2010', 1
SELECT a, MAX(b), MAX(c) FROM #TEST
GROUP BY a
Which will return
----------- ---------- -----------
1 2010-01-03 3
2 2010-01-01 1
Notice that 1/03/2010 and 3 are not in the same row. In this situation I don't think it matters to you, but just a heads up.
As for the actual question- in SQL2005 we would probably apply a ROW_NUMBER over the groups to get the row with the latest date for each part, however you don't have access to this feature in 2000. If the above is giving you correct results I'd say use it.