How rewrite COUNT DISTINCT?

How rewrite COUNT DISTINCT? - sql

I have a problem with only one reducer in hive, because of using count and distinct in one query.
How to rewrite select to eliminate this? Is it possible in window functions?
select
a.second_id,
if(a.proc_id = 'CONST1' and bb.third_id is not null,
count(distinct bb.first_id),
'') as qty
from a a
join (select
b.first_id,
b.second_id,
b.third_id
from b b) bb
on bb.second_id = a.second_id
group by
a.second_id,
a.proc_id,
bb.third_id;

This is your query:
select a.second_id,
(case when a.proc_id = 'CONST1' and bb.third_id is not null
then count(distinct bb.first_id)
end) as qty
from a join
(select b.first_id, b.second_id, b.third_id
from b
) bb
on bb.second_id = a.second_id
group by a.second_id, a.proc_id, bb.third_id;
The count(distinct) can really be handled in the subquery, using group by and window functions. I don't see any value to not aggregating first, so:
select a.second_id,
(case when a.proc_id = 'CONST1' and bb.third_id is not null
then max(bb.num_firsts)
end) as qty
from a join
(select b.second_id, b.third_id,
count(distinct first_id) as num_firsts
from b
group by b.second_id, b.third_id
) bb
on bb.second_id = a.second_id
group by a.second_id, a.proc_id, bb.third_id;
You are aggregating by second_id and third_id in the outer query. So there is only one row from the aggregated subquery in the outer query. The above version uses max(first_id), but you could also include num_firsts in the outer group by.
That still might not fix your problem, but this query is easier to modify. If I recall, the best approach in Hive is a select distinct subquery:
select a.second_id,
(case when a.proc_id = 'CONST1' and bb.third_id is not null
then max(bb.num_firsts)
end) as qty
from a join
(select b.second_id, b.third_id,
count(*) as num_firsts
from (select distinct second_id, third_id, first_id
from b
) b
group by b.second_id, b.third_id
) bb
on bb.second_id = a.second_id
group by a.second_id, a.proc_id, bb.third_id;
This is the same thing if first_id is never null. This will count that as a separate value; if you don't want to, just filter them out.

Related

Window function issue - max over partition

I try to rewrite such SQL statements (with many subqueries) to more efficient form using outer join and max/count/... over partition. Old statements:
select a.ID,
(select max(b.valA) from something b where a.ID = b.ID_T and b.status != 0),
(select max(b.valB) from something b where a.ID = b.ID_T and b.status != 0),
(select max(b.valC) from something b where a.ID = b.ID_T and b.status != 0),
(select max(b.valD) from something b where a.ID = b.ID_T)
from tool a;
What is important here - there is different condition for max(b.valD). Firstly I didn't noticed this difference and write something like this:
select distinct a.ID,
max(b.valA) over (partition by b.ID_T),
max(b.valB) over (partition by b.ID_T),
max(b.valC) over (partition by b.ID_T),
max(b.valD) over (partition by b.ID_T),
from tool a,
(select * from something
where status != 0) b
where a.ID = b.ID_T(+);
Could I use somewhere in max over partition this condition of b.status != 0 ? Or should I better add 3rd table to join like this:
select distinct a.ID,
max(b.valA) over (partition by b.ID_T),
max(b.valB) over (partition by b.ID_T),
max(b.valC) over (partition by b.ID_T),
max(c.valD) over (partition by c.ID_T),
from tool a,
(select * from something
where status != 0) b,
something c
where a.ID = b.ID_T(+)
and a.ID = c.ID_T(+);
The issue is with selecting and joining millions of rows, my example is just simplification of my query. Could anyone help me to achieve more efficient sql?

You could try to do this using CASE:
select a.ID,
max(CASE WHEN b.status=0 THEN b.valA END),
max(CASE WHEN b.status=0 THEN b.valB END),
max(CASE WHEN b.status=0 THEN b.valC END),
max(b.valD)
from tool a
left join something b ON( b.ID_T = a.ID )
group by a.ID;
Note that I replaced your implicit join by the "new" join-syntax for better readability.

One more way is to use JOIN and group by subquery:
select a.ID,
b.MAX_A,
b.MAX_B,
b.MAX_C,
b2.MAX_D
from tool a
LEFT JOIN
(
SELECT ID_T,max(valA) MAX_A, max(valB) MAX_B, max(valC) MAX_C
FROM something
WHERE status != 0
GROUP BY ID_T
) b
ON a.ID=b.ID_T
LEFT JOIN
(
SELECT ID_T, max(valD) MAX_D
FROM something
GROUP BY ID_T
) b2
ON a.ID=b2.ID_T

Complex Full Outer Join

Sigh ... can anyone help? In the SQL query below, the results I get are incorrect. There are three (3) labor records in [LaborDetail]
Hours / Cost
2.75 / 50.88
2.00 / 74.00
1.25 / 34.69
There are two (2) material records in [WorkOrderInventory]
Material Cost
42.75
35.94
The issue is that the query incorrectly returns the following:
sFunction cntWO sumLaborHours sumLaborCost sumMaterialCost
ROBOT HARNESS 1 12 319.14 236.07
What am I doing wrong in the query that is causing the sums to be multiplied? The correct values are sumLaborHours = 6, sumLaborCost = 159.57, and sumMaterialCost = 78.69. Thank you for your help.
SELECT CASE WHEN COALESCE(work_orders.location, Work_Orders_Archived.location) IS NULL
THEN '' ELSE COALESCE(work_orders.location, Work_Orders_Archived.location) END AS sFunction,
(SELECT COUNT(*)
FROM work_orders
FULL OUTER JOIN Work_Orders_Archived
ON work_orders.order_number = Work_Orders_Archived.order_number
WHERE COALESCE(work_orders.order_number, Work_Orders_Archived.order_number) = '919630') AS cntWO,
SUM(Laborhours) AS sumLaborHours,
SUM(LaborCost) AS sumLaborCost,
SUM(MaterialCost*MaterialQuanity) AS sumMaterialCost
FROM work_orders
FULL OUTER JOIN Work_Orders_Archived
ON work_orders.order_number = Work_Orders_Archived.order_number
LEFT OUTER JOIN
(SELECT HoursWorked AS Laborhours, TotalDollars AS LaborCost, WorkOrderNo
FROM LaborDetail) AS LD
ON COALESCE(work_orders.order_number, Work_Orders_Archived.order_number) = LD.WorkOrderNo
LEFT OUTER JOIN
(SELECT UnitCost AS MaterialCost, Qty AS MaterialQuanity, OrderNumber
FROM WorkOrderInventory) AS WOI
ON COALESCE(work_orders.order_number, Work_Orders_Archived.order_number) = WOI.OrderNumber
WHERE COALESCE(work_orders.order_number, Work_Orders_Archived.order_number) = '919630'
GROUP BY CASE WHEN COALESCE(work_orders.location, Work_Orders_Archived.location) IS NULL
THEN '' ELSE COALESCE(work_orders.location, Work_Orders_Archived.location) END
ORDER BY sFunction

Try using the SUM function inside a derived table subquery when doing the full join to "WorkOrderInventory" like so...
select
...
sum(hrs) as sumlaborhrs,
sum(cost) as sumlaborcost,
-- calculate material cost in subquery
summaterialcost
from labordetail a
full outer join
(select ordernumber, sum(materialcost) as summaterialcost
from WorkOrderInventory
group by ordernumber
) b on a.workorderno = b.ordernumber
i created a simple sql fiddle to demonstrate this (i simplified your query for examples sake)

Looks to me that work_orders and work_orders_archived contains the same thing and you need both tables as if they were one table. So you could instead of joining create a UNION and use it as if it was one table:
select location as sfunction
from
(select location
from work_orders
union location
from work_orders_archived)
Then you use it to join the rest. What DBMS are you on? You could use WITH. But this does not exist on MYSQL.
with wo as
(select location as sfunction, order_number
from work_orders
union location, order_number
from work_orders_archived)
select sfunction,
count(*)
SUM(Laborhours) AS sumLaborHours,
SUM(LaborCost) AS sumLaborCost,
SUM(MaterialCost*MaterialQuanity) AS sumMaterialCost
from wo
LEFT OUTER JOIN
(SELECT HoursWorked AS Laborhours, TotalDollars AS LaborCost, WorkOrderNo
FROM LaborDetail) AS LD
ON COALESCE(work_orders.order_number, Work_Orders_Archived.order_number) = LD.WorkOrderNo
LEFT OUTER JOIN
(SELECT UnitCost AS MaterialCost, Qty AS MaterialQuanity, OrderNumber
FROM WorkOrderInventory) AS WOI
ON COALESCE(work_orders.order_number, Work_Orders_Archived.order_number) = WOI.OrderNumber
where wo.order_number = '919630'
group by sfunction
order by sfunction

The best guess is that the work orders appear more than once in one of the tables. Try these queries to check for duplicates in the two most obvious candidate tables:
select cnt, COUNT(*), MIN(order_number), MAX(order_number)
from (select order_number, COUNT(*) as cnt
from work_orders
group by order_number
) t
group by cnt
order by 1;
select cnt, COUNT(*), MIN(order_number), MAX(order_number)
from (select order_number, COUNT(*) as cnt
from work_orders_archived
group by order_number
) t
group by cnt
order by 1;
If either returns a row where cnt is not 1, then you have duplicates in the tables.

Oracle query - how to make count to return values with 0

How can I make a count to return also the values with 0 in it.
Example:
select count(1), equipment_name
from alarms.new_alarms
where equipment_name in (
select eqp from ne_db.ne_list)
Group by equipment_name
It is returning only the counts with values higher than 0 , but I need to know the records that are not returning anything.
Any help is greatly appreciated.
thanks,
Marco

Try using LEFT JOIN,
SELECT a.eqp, COUNT(b.equipment_name) totalCount
FROM ne_db.ne_list a
LEFT JOIN alarms.new_alarms b
ON a.eqp = b.equipment_name
GROUP BY a.eqp

If the table ne_list has no duplicates, then you can do a left join. That assumption may not be true, so the safest way to convert this is by removing duplicates in a subquery:
select count(1), ne.equipment_name
from alarms.new_alarms ne left outer join
(select distinct eqp
from ne_db.ne_list
) eqp
on ne.equipment_name = eqp.eqp
Group by ne.equipment_name

You could use a left join:
select ne.equipment_name
, count(na.equipment_name)
from ne_db.ne_list ne
left join
alarms.new_alarms na
on ne.eqp = na.equipment_name
group by
ne.equipment_name

This should work too
(select equipment_name, count(*) as cnt from alarms.new_alarms
where equipment_name in (
select eqp from ne_db.ne_list
) group by equipment_name
)union(
select equipment_name, 0 as cnt from alarms.new_alarms
where equipment_name not in (
select eqp from ne_db.ne_list
) group by equipment_name
) order by equipment_name

Inner join that ignore singlets

I have to do an self join on a table. I am trying to return a list of several columns to see how many of each type of drug test was performed on same day (MM/DD/YYYY) in which there were at least two tests done and at least one of which resulted in a result code of 'UN'.
I am joining other tables to get the information as below. The problem is I do not quite understand how to exclude someone who has a single result row in which they did have a 'UN' result on a day but did not have any other tests that day.
Query Results (Columns)
County, DrugTestID, ID, Name, CollectionDate, DrugTestType, Results, Count(DrugTestType)
I have several rows for ID 12345 which are correct. But ID 12346 is a single row of which is showing they had a row result of count (1). They had a result of 'UN' on this day but they did not have any other tests that day. I want to exclude this.
I tried the following query
select
c.desc as 'County',
dt.pid as 'PID',
dt.id as 'DrugTestID',
p.id as 'ID',
bio.FullName as 'Participant',
CONVERT(varchar, dt.CollectionDate, 101) as 'CollectionDate',
dtt.desc as 'Drug Test Type',
dt.result as Result,
COUNT(dt.dru_drug_test_type) as 'Count Of Test Type'
from
dbo.Test as dt with (nolock)
join dbo.History as h on dt.pid = h.id
join dbo.Participant as p on h.pid = p.id
join BioData as bio on bio.id = p.id
join County as c with (nolock) on p.CountyCode = c.code
join DrugTestType as dtt with (nolock) on dt.DrugTestType = dtt.code
inner join
(
select distinct
dt2.pid,
CONVERT(varchar, dt2.CollectionDate, 101) as 'CollectionDate'
from
dbo.DrugTest as dt2 with (nolock)
join dbo.History as h2 on dt2.pid = h2.id
join dbo.Participant as p2 on h2.pid = p2.id
where
dt2.result = 'UN'
and dt2.CollectionDate between '11-01-2011' and '10-31-2012'
and p2.DrugCourtType = 'AD'
) as derived
on dt.pid = derived.pid
and convert(varchar, dt.CollectionDate, 101) = convert(varchar, derived.CollectionDate, 101)
group by
c.desc, dt.pid, p.id, dt.id, bio.fullname, dt.CollectionDate, dtt.desc, dt.result
order by
c.desc ASC, Participant ASC, dt.CollectionDate ASC

This is a little complicated because the your query has a separate row for each test. You need to use window/analytic functions to get the information you want. These allow you to do calculate aggregation functions, but to put the values on each line.
The following query starts with your query. It then calculates the number of UN results on each date for each participant and the total number of tests. It applies the appropriate filter to get what you want:
with base as (<your query here>)
select b.*
from (select b.*,
sum(isUN) over (partition by Participant, CollectionDate) as NumUNs,
count(*) over (partition by Partitipant, CollectionDate) as NumTests
from (select b.*,
(case when result = 'UN' then 1 else 0 end) as IsUN
from base
) b
) b
where NumUNs <> 1 or NumTests <> 1
Without the with clause or window functions, you can create a particularly ugly query to do the same thing:
select b.*
from (<your query>) b join
(select Participant, CollectionDate, count(*) as NumTests,
sum(case when result = 'UN' then 1 else 0 end) as NumUNs
from (<your query>) b
group by Participant, CollectionDate
) bsum
on b.Participant = bsum.Participant and
b.CollectionDate = bsum.CollectionDate
where NumUNs <> 1 or NumTests <> 1

If I understand the problem, the basic pattern for this sort of query is simply to include negating or exclusionary conditions in your join. I.E., self-join where columnA matches, but columns B and C do not:
select
[columns]
from
table t1
join table t2 on (
t1.NonPkId = t2.NonPkId
and t1.PkId != t2.PkId
and t1.category != t2.category
)
Put the conditions in the WHERE clause if it benchmarks better:
select
[columns]
from
table t1
join table t2 on (
t1.NonPkId = t2.NonPkId
)
where
t1.PkId != t2.PkId
and t1.category != t2.category
And it's often easiest to start with the self-join, treating it as a "base table" on which to join all related information:
select
[columns]
from
(select
[columns]
from
table t1
join table t2 on (
t1.NonPkId = t2.NonPkId
)
where
t1.PkId != t2.PkId
and t1.category != t2.category
) bt
join [othertable] on (<whatever>)
join [othertable] on (<whatever>)
join [othertable] on (<whatever>)
This can allow you to focus on getting that self-join right, without interference from other tables.

Select all data not in Top 'n' as 'Other'

I hope somebody may be able to point out where i'm going wrong here but i've been looking at this for the last 30 minutes and not gotten anywhere with it.
I have a temporary table that is populated with data, the front end application cannot do any logic for me so please excuse the ugly case statement logic in the table.
The user is happy with the resultset brought back as I get the top 10 records. They have now decided they want to see a group of the remaining countries (all rows not in the top 10) as 'Other'.
I have tried to create a grouping of countries not in the top 10 but it's not working, I was planning on UNION'ing this result to the top 10 results.
SELECT c.Country, count(*) AS 'Total_Number_of_customers', COALESCE(ili.new_customers,0) AS 'New_Customers', COALESCE(ilb.existing_first,0) AS 'Existing_First_Trans', COALESCE(ilc.existing_old,0) AS 'Existing_Prev_Trans'
FROM #customer_tmp c
LEFT JOIN (SELECT z.country, count(*) AS 'new_customers' FROM #customer_tmp z where z.customer_type='New_Customer' group by z.country)ili ON ili.country = c.country
LEFT JOIN (SELECT zy.country, count(*) AS 'existing_first' FROM #customer_tmp zy where zy.customer_type='Existing_Customer' AND zy.first_transaction=1 group by zy.country)ilb ON ilb.country = c.country
LEFT JOIN (SELECT zx.country, count(*) AS 'existing_old' FROM #customer_tmp zx where zx.customer_type='Existing_Customer' AND zx.first_transaction=0 group by zx.country)ilc ON ilc.country = c.country
GROUP BY c.country, ili.new_customers, ilb.existing_first, ilc.existing_old
ORDER BY 2 DESC
Here is the SQL that I use to get results from my table.
For reference, each row in my temporary table contains a customer ID, the date they were created and their customer type, which is specific to what i'm trying to achieve.
Hopefully this is a simple problem and i'm just being a bit slow..
Many thanks in Adv.

Use the EXCEPT operator in SQL Server:
SELECT <fields>
FROM <table>
WHERE <conditons>
EXCEPT
<Query you want excluded>

Yup; EXCEPT, or maybe add a row number to your query and then select by that:
SELECT * FROM (
SELECT c.Country, count(*) AS 'Total_Number_of_customers',
row_number() OVER (ORDER BY COUNT(*) DESC) AS 'r',
COALESCE(ili.new_customers,0) AS 'New_Customers', COALESCE(ilb.existing_first,0) AS 'Existing_First_Trans', COALESCE(ilc.existing_old,0) AS 'Existing_Prev_Trans'
FROM #customer_tmp c
LEFT JOIN (SELECT z.country, count(*) AS 'new_customers' FROM #customer_tmp z where z.customer_type='New_Customer' group by z.country)ili ON ili.country = c.country
LEFT JOIN (SELECT zy.country, count(*) AS 'existing_first' FROM #customer_tmp zy where zy.customer_type='Existing_Customer' AND zy.first_transaction=1 group by zy.country)ilb ON ilb.country = c.country
LEFT JOIN (SELECT zx.country, count(*) AS 'existing_old' FROM #customer_tmp zx where zx.customer_type='Existing_Customer' AND zx.first_transaction=0 group by zx.country)ilc ON ilc.country = c.country
GROUP BY c.country, ili.new_customers, ilb.existing_first, ilc.existing_old
ORDER BY 2 DESC
) sub_query WHERE sub_query.r >= 10
This may be more flexible, as you can run one query and then divide the results up into "top ten" and "the rest" quite easily.
(This is equivalent to bobs' answer; I guess we were working on this at exactly the same time!)

Here's an approach using common table expressions (CTE)
WITH CTE AS
(
SELECT c.Country, count(*) AS 'Total_Number_of_customers', COALESCE(ili.new_customers,0) AS 'New_Customers', COALESCE(ilb.existing_first,0) AS 'Existing_First_Trans', COALESCE(ilc.existing_old,0) AS 'Existing_Prev_Trans'
, ROW_NUMBER() OVER (ORDER BY count(*) DESC) AS sequence
FROM #customer_tmp c
LEFT JOIN (SELECT z.country, count(*) AS 'new_customers' FROM #customer_tmp z where z.customer_type='New_Customer' group by z.country)ili ON ili.country = c.country
LEFT JOIN (SELECT zy.country, count(*) AS 'existing_first' FROM #customer_tmp zy where zy.customer_type='Existing_Customer' AND zy.first_transaction=1 group by zy.country)ilb ON ilb.country = c.country
LEFT JOIN (SELECT zx.country, count(*) AS 'existing_old' FROM #customer_tmp zx where zx.customer_type='Existing_Customer' AND zx.first_transaction=0 group by zx.country)ilc ON ilc.country = c.country
GROUP BY c.country, ili.new_customers, ilb.existing_first, ilc.existing_old
)
SELECT *
FROM CTE
WHERE sequence > 10
ORDER BY sequence

SELECT country, COUNT(*) cnt, SUM(new_customer), SUM(existing_first_trans), SUM(existing_prev_trans)
FROM (
SELECT CASE
WHEN country IN
(
SELECT TOP 10 country
FROM #customer_tmp
ORDER BY
COUNT(*) DESC
) THEN
country
ELSE 'Others'
END AS country,
CASE WHEN customer_type = 'New_Customer' THEN 1 END AS new_customer,
CASE WHEN customer_type = 'Existing_Customer' AND first_transaction = 1 THEN 1 AS existing_first_trans,
CASE WHEN customer_type = 'Existing_Customer' AND first_transaction = 0 THEN 1 AS existing_prev_trans,
FROM #customer_tmp
)
GROUP BY
country
ORDER BY
CASE country WHEN 'Others' THEN 2 ELSE 1 END, cnt DESC

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How rewrite COUNT DISTINCT? - sql

Related

Window function issue - max over partition

Complex Full Outer Join

Oracle query - how to make count to return values with 0

Inner join that ignore singlets

Select all data not in Top 'n' as 'Other'

Categories

Resources