Grouping over multiple columns and counting distinct over different groups - sql

Given this data
month
id
1
x
1
x
1
y
2
z
2
x
2
y
My output should be
month
distinct_id
total_id
1
2
3
2
3
3
How can I achieve this in a single query?
I tried this query
SELECT TO_CHAR(DOCDATE,'MON') MON
,COUNT(DISTINCT T.MOB_MTCHED_LYLTY_ID) OVER() SHARE
from data
group by 1
but this is giving me an error

select month,
count(distinct id) distinct_id,
count(id) total_id
from data
group by month;

SELECT [Month], COUNT(DISTINCT id) as dist_id, COUNT(id) as count_id
FROM data
GROUP BY Month
Also i should say:
About your code - don't use OVER if it's not necessary
Don't use picutes in your question like you use it know - provide data in a small table is better

Related

Checking conditions per group, and ranking most recent row?

I'm handling a table like so:
Name
Status
Date
Alfred
1
Jan 1 2023
Alfred
2
Jan 2 2023
Alfred
3
Jan 2 2023
Alfred
4
Jan 3 2023
Bob
1
Jan 1 2023
Bob
3
Jan 2 2023
Carl
1
Jan 5 2023
Dan
1
Jan 8 2023
Dan
2
Jan 9 2023
I'm trying to setup a query so I can handle the following:
I'd like to pull the most recent status per Name,
SELECT MAX(Date), Status, Name
FROM test_table
GROUP BY Status, Name
Additionally I'd like in the same query to be able to pull if the user has ever had a status of 2, regardless of if the most recent one is 2 or not
WITH has_2_table AS (
SELECT DISTINCT Name, TRUE as has_2
FROM test_table
WHERE Status = 2 )
And then maybe joining the above on a left join on Name?
But having these as two seperate queries and joining them feels clunky to me, especially since I'd like to add additional columns and other checks. Is there a better way to set this up in one singular query, or is this the most effecient way?
You said, "I'd like to add additional columns" so I interpret that to mean you would like to Select the entire most recent record and add an 'ever-2' column.
You can either do this by joining two queries, or use window functions. Not knowing Snowflake Cloud Data, I cannot tell you which is more efficient.
Join 2 Queries
Select A.*,Coalesce(B.Ever2,"No") as Ever2
From (
Select * From testable x
Where date=(Select max(date) From test_table y
Where x.name=y.name)
) A Left Outer Join (
Select name,"Yes" as Ever2 From test_table
Where status=2
Group By name
) B On A.name=B.name
The first subquery can also be written as an Inner Join if correlated subqueries are implemented badly on your platform.
use of Window Functions
Select * From (
Select row_number() Over (Partition by name, order by date desc, status desc) as bestrow,
A.*,
Coalesce(max(Case When status=2 Then "Yes" End) Over (Partition By name Rows Unbounded Preceding And Unbounded Following), "No") as Ever2
From test_table A
)
Where bestrow=1
This second query type always reads and sorts the entire test_table so it might not be the most efficient.
Given that you have a different partitioning on the two aggregations, you could try going with window functions instead:
SELECT DISTINCT Name,
MAX(Date) OVER(
PARTITION BY Name, Status
) AS lastdate,
MAX(CASE WHEN Status = 2 THEN 1 ELSE 0 END) OVER(
PARTITION BY Name
) AS status2
FROM tab
I'd like to pull the most recent status per name […] Additionally I'd like in the same query to be able to pull if the user has ever had a status of 2.
Snowflake has sophisticated aggregate functions.
Using group by, we can get the latest status with arrays and check for a given status with boolean aggregation:
select name, max(date) max_date,
get(array_agg(status) within group (order by date desc), 0) last_status,
boolor_agg(status = 2) has_status2
from mytable
group by name
We could also use window functions and qualify:
select name, date as max_date,
status as last_status,
boolor_agg(status = 2) over(partition by name) has_status2
from mytable
qualify rank() over(order by name order by date desc) = 1

Sum over period

I have some doubts regarding a sum of rows. I have the following dataset in Teradata SQL Assistant:
id period avg_amt flag
111 1 123.5 1
211 1 143.1 1
311 2 122.1 1
411 3 214.5 1
511 3 124.6 0
611 3 153.2 1
I would like to sum the flags based on the period.
What I tried is to use the sum function over the period in two different ways:
select
id, period, avg_amt, flag, sum(flag) over (partition by id order by period)
from dataset
and
select
id, period, avg_amt, flag, sum(flag)
group by id, period, avg_amt, flag
from dataset
The output does not return what I should expect, i.e. for period 1 sum=3, period 2 sum 1, period 3 sum 2.
Could you please tell me what is wrong? Thanks
To get the simple sum:
select period, sum(flag) total_flag
from dataset
group by period
In SQL server, to add back in the rest of the information, you can use a subquery and join it back in:
select id, dataset.period, avg_amt, flag, total_flag
from dataset
inner join (
select period, sum(flag) total_flag
from dataset
group by period
) TF on TF.period=dataset.period
I hope this is still good with teradata-sql-assistant.

Calculate distinct totals over time

I have the following data:
UniqueID SenderID EntryID Date
1 1 1 2015-09-17
2 1 1 2015-09-23
3 2 1 2015-09-17
4 2 1 2015-09-17
5 3 1 2015-09-17
6 4 1 2015-09-19
7 3 1 2015-09-20
What I require is the following:
3 2015-09-17
4 2015-09-19
4 2015-09-20
4 2015-09-23
Where the first column is the total unique entries upto that date. So for example the entry on the 23/9 of Sender 1 and Entry 1 does not increase the total column because there is a duplicate from the 17/9.
How can I do this efficiently ideally without joining on the same table as what you end up with is a very large query which is not practical. I have done something similar in Postgres with OVER() but unfortunately this isn't available in this setup.
I could also do this in code - which I have but yet again it has to calculate outside of the db system and then import back in. With millions of rows this process takes days and I ideally only have hours.
OVER is ANSI standard functionality available in most databases. What you are counting are starts for users, and you can readily do this with a cumulative sum:
select startdate,
sum(count(*)) over (order by startdate) as CumulativeUniqueCount
from (select senderid, min(date) as startdate
from table t
group by senderid
) t
group by startdate
order by startdate;
This should work in any database that supports window functions, such as Oracle, SQL Server 2012+, Postgres, Teradata, DB2, Hive, Redshift, to name a few.
EDIT:
You need a left join to get all the dates in the data:
select d.date,
sum(count(d.date)) over (order by d.date) as CumulativeUniqueCount
from (select distinct date from table t) d left join
(select senderid, min(date) as startdate
from table t
group by senderid
) t
on t.startdate = d.date
group by d.date
order by d.date;
Credit to Gordon Linoff for the basic query. However, it will not return rows for dates that don't increase the cumulative sum.
To get those extra rows, you need to include an additional subquery that lists all the distinct dates from the table. And then you left join with Gordon's query + a few minor tweaks to get the desired result:
select d.SomeDate,
sum(count(t.SenderId)) over (order by d.SomeDate)
from (select distinct SomeDate
from SomeTable) d
left join (select SenderId, min(somedate) as MinDate
from SomeTable
group by SenderId) t
on d.SomeDate = t.MinDate
group by d.SomeDate
order by d.SomeDate;

Grouping by number of occurrences of a repeatable value in Oracle SQL

Lets assume we have a table like this.
id name value
1 x 12
2 x 23
3 y 47
4 x 18
5 y 29
6 z 45
7 y 67
Doing a normal group by name would yield us
select name,count(*) from table group by name;
name count(*)
x 3
y 3
z 1
I want to get the reverse.. ie. grouping the number of names that occur a set number of times. I want my output to be
count number of elements occuring count times
1 1
3 2
Is it possible to do this using just a single query? Another way is to use a temp table but I dont want to do that.
Thanks
You need one more group by:
select cnt, count(*), min(name), max(name)
from (select name, count(*) as cnt
from table
group by name
) n
group by cnt
order by 1;
I do these types of histogram queries all the time. The min() and max() provide sample data. This is useful to understand outliers and unexpected values.
You can GROUP BY twice, e.g.
with
Names as (
select name as name,
count(1) as cnt
from MyTable
group by name)
select count(1),
cnt
from Names
group by cnt

SQL to get distinct statistics

Suppose I have data in table X:
id assign team
----------------------
1 hunkim A
1 ygg A
2 hun B
2 gw B
2 david B
3 haha A
I want to know how many assigns for each id. I can get using:
select id, count(distinct assign) from
X group by id
order by count(distinct assign)desc;
It will give me something:
1 2
2 3
3 1
My question is how can I get the average of the all assign counts?
In addition, now I want to know the everage per team. So I want to get something like:
team assign_avg
-------------------
A 1.5
B 3
Thanks in advance!
SELECT
AVG(CAST(assign_count AS DECIMAL(10, 4)))
FROM
(SELECT
id,
COUNT(DISTINCT assign) AS assign_count
FROM
X
GROUP BY
id) Assign_Counts
.
SELECT
team,
AVG(CAST(assign_count AS DECIMAL(10, 4)))
FROM
(SELECT
id,
team,
COUNT(DISTINCT assign) AS assign_count
FROM
X
GROUP BY
id,
team) Assign_Counts
GROUP BY
Team
What you want can be done in one query, using aggregate functions COUNT and AVG:
SELECT t.id,
COUNT(*) AS num_instances,
AVG(t.id) AS assign_avg
FROM TABLE t
GROUP BY t.id
Columns that do not have an aggregate function performed on them need to be defined in the GROUP BY clause.