Calculate distinct totals over time - sql

I have the following data:
UniqueID SenderID EntryID Date
1 1 1 2015-09-17
2 1 1 2015-09-23
3 2 1 2015-09-17
4 2 1 2015-09-17
5 3 1 2015-09-17
6 4 1 2015-09-19
7 3 1 2015-09-20
What I require is the following:
3 2015-09-17
4 2015-09-19
4 2015-09-20
4 2015-09-23
Where the first column is the total unique entries upto that date. So for example the entry on the 23/9 of Sender 1 and Entry 1 does not increase the total column because there is a duplicate from the 17/9.
How can I do this efficiently ideally without joining on the same table as what you end up with is a very large query which is not practical. I have done something similar in Postgres with OVER() but unfortunately this isn't available in this setup.
I could also do this in code - which I have but yet again it has to calculate outside of the db system and then import back in. With millions of rows this process takes days and I ideally only have hours.

OVER is ANSI standard functionality available in most databases. What you are counting are starts for users, and you can readily do this with a cumulative sum:
select startdate,
sum(count(*)) over (order by startdate) as CumulativeUniqueCount
from (select senderid, min(date) as startdate
from table t
group by senderid
) t
group by startdate
order by startdate;
This should work in any database that supports window functions, such as Oracle, SQL Server 2012+, Postgres, Teradata, DB2, Hive, Redshift, to name a few.
EDIT:
You need a left join to get all the dates in the data:
select d.date,
sum(count(d.date)) over (order by d.date) as CumulativeUniqueCount
from (select distinct date from table t) d left join
(select senderid, min(date) as startdate
from table t
group by senderid
) t
on t.startdate = d.date
group by d.date
order by d.date;

Credit to Gordon Linoff for the basic query. However, it will not return rows for dates that don't increase the cumulative sum.
To get those extra rows, you need to include an additional subquery that lists all the distinct dates from the table. And then you left join with Gordon's query + a few minor tweaks to get the desired result:
select d.SomeDate,
sum(count(t.SenderId)) over (order by d.SomeDate)
from (select distinct SomeDate
from SomeTable) d
left join (select SenderId, min(somedate) as MinDate
from SomeTable
group by SenderId) t
on d.SomeDate = t.MinDate
group by d.SomeDate
order by d.SomeDate;

Related

Checking conditions per group, and ranking most recent row?

I'm handling a table like so:
Name
Status
Date
Alfred
1
Jan 1 2023
Alfred
2
Jan 2 2023
Alfred
3
Jan 2 2023
Alfred
4
Jan 3 2023
Bob
1
Jan 1 2023
Bob
3
Jan 2 2023
Carl
1
Jan 5 2023
Dan
1
Jan 8 2023
Dan
2
Jan 9 2023
I'm trying to setup a query so I can handle the following:
I'd like to pull the most recent status per Name,
SELECT MAX(Date), Status, Name
FROM test_table
GROUP BY Status, Name
Additionally I'd like in the same query to be able to pull if the user has ever had a status of 2, regardless of if the most recent one is 2 or not
WITH has_2_table AS (
SELECT DISTINCT Name, TRUE as has_2
FROM test_table
WHERE Status = 2 )
And then maybe joining the above on a left join on Name?
But having these as two seperate queries and joining them feels clunky to me, especially since I'd like to add additional columns and other checks. Is there a better way to set this up in one singular query, or is this the most effecient way?
You said, "I'd like to add additional columns" so I interpret that to mean you would like to Select the entire most recent record and add an 'ever-2' column.
You can either do this by joining two queries, or use window functions. Not knowing Snowflake Cloud Data, I cannot tell you which is more efficient.
Join 2 Queries
Select A.*,Coalesce(B.Ever2,"No") as Ever2
From (
Select * From testable x
Where date=(Select max(date) From test_table y
Where x.name=y.name)
) A Left Outer Join (
Select name,"Yes" as Ever2 From test_table
Where status=2
Group By name
) B On A.name=B.name
The first subquery can also be written as an Inner Join if correlated subqueries are implemented badly on your platform.
use of Window Functions
Select * From (
Select row_number() Over (Partition by name, order by date desc, status desc) as bestrow,
A.*,
Coalesce(max(Case When status=2 Then "Yes" End) Over (Partition By name Rows Unbounded Preceding And Unbounded Following), "No") as Ever2
From test_table A
)
Where bestrow=1
This second query type always reads and sorts the entire test_table so it might not be the most efficient.
Given that you have a different partitioning on the two aggregations, you could try going with window functions instead:
SELECT DISTINCT Name,
MAX(Date) OVER(
PARTITION BY Name, Status
) AS lastdate,
MAX(CASE WHEN Status = 2 THEN 1 ELSE 0 END) OVER(
PARTITION BY Name
) AS status2
FROM tab
I'd like to pull the most recent status per name […] Additionally I'd like in the same query to be able to pull if the user has ever had a status of 2.
Snowflake has sophisticated aggregate functions.
Using group by, we can get the latest status with arrays and check for a given status with boolean aggregation:
select name, max(date) max_date,
get(array_agg(status) within group (order by date desc), 0) last_status,
boolor_agg(status = 2) has_status2
from mytable
group by name
We could also use window functions and qualify:
select name, date as max_date,
status as last_status,
boolor_agg(status = 2) over(partition by name) has_status2
from mytable
qualify rank() over(order by name order by date desc) = 1

Grouping over multiple columns and counting distinct over different groups

Given this data
month
id
1
x
1
x
1
y
2
z
2
x
2
y
My output should be
month
distinct_id
total_id
1
2
3
2
3
3
How can I achieve this in a single query?
I tried this query
SELECT TO_CHAR(DOCDATE,'MON') MON
,COUNT(DISTINCT T.MOB_MTCHED_LYLTY_ID) OVER() SHARE
from data
group by 1
but this is giving me an error
select month,
count(distinct id) distinct_id,
count(id) total_id
from data
group by month;
SELECT [Month], COUNT(DISTINCT id) as dist_id, COUNT(id) as count_id
FROM data
GROUP BY Month
Also i should say:
About your code - don't use OVER if it's not necessary
Don't use picutes in your question like you use it know - provide data in a small table is better

Postgres/SQL subquery - return multiples columns per grouping based on condition

Struggling with this subquery - it should be basic, but I'm missing something. I need to make these available as apart of a larger query.
I have customers, and I want to get the ONE transaction with the HIGHEST timestamp.
Customer
customer foo
1 val1
2 val2
Transaction
tx_key customer timestamp value
1 1 11/22 10
2 1 11/23 15
3 2 11/24 20
4 2 11/25 25
The desired of the query:
customer foo timestamp value
1 val1 11/23 15
2 val2 11/25 25
I successfully wrote a subquery to calculate what I needed by using multiple sub queries, but it is very slow when I have a larger data set.
I did it like this:
(select timestamp where transaction.customer = customer.customer order by timestamp desc limit 1) as tx_timestamp
(select value where transaction.customer = customer.customer order by timestamp desc limit 1) as tx_value
So how do I reduce this down to only calculating it once? In my real data set, i have 15 columns joined over 100k rows, so doing this over and over is not performant enough.
In Postgres, the simplest method is distinct on:
select distinct on (cust_id) c.*, t.timestamp, t.value
from transactions t join
customer c
using (cust_id)
order by cust_id, timestamp desc;
Try this query please:
SELECT
T.customer, T.foo, T.timestamp, T.value
FROM Transaction T
JOIN
(SELECT
customer, max(timestamp) as timestamp
from Transaction GROUP BY customer) MT ON
T.customer = MT.customer
AND t.timestamp = MT.timestamp

Finding the first occurrence of an element in a SQL database

I have a table with a column for customer names, a column for purchase amount, and a column for the date of the purchase. Is there an easy way I can find how much first time customers spent on each day?
So I have
Name | Purchase Amount | Date
Joe 10 9/1/2014
Tom 27 9/1/2014
Dave 36 9/1/2014
Tom 7 9/2/2014
Diane 10 9/3/2014
Larry 12 9/3/2014
Dave 14 9/5/2014
Jerry 16 9/6/2014
And I would like something like
Date | Total first Time Purchase
9/1/2014 73
9/3/2014 22
9/6/2014 16
Can anyone help me out with this?
The following is standard SQL and works on nearly all DBMS
select date,
sum(purchaseamount) as total_first_time_purchase
from (
select date,
purchaseamount,
row_number() over (partition by name order by date) as rn
from the_table
) t
where rn = 1
group by date;
The derived table (the inner select) selects all "first time" purchases and the outside the aggregates based on the date.
The two key concepts here are aggregates and sub-queries, and the details of which dbms you're using may change the exact implementation, but the basic concept is the same.
For each name, determine they're first date
Using the results of 1, find each person's first day purchase amount
Using the results of 2, sum the amounts for each date
In SQL Server, it could look like this:
select Date, [totalFirstTimePurchases] = sum(PurchaseAmount)
from (
select t.Date, t.PurchaseAmount, t.Name
from table1 t
join (
select Name, [firstDate] = min(Date)
from table1
group by Name
) f on t.Name=f.Name and t.Date=f.firstDate
) ftp
group by Date
If you are using SQL Server you can accomplish this with either sub-queries or CTEs (Common Table Expressions). Since there is already an answer with sub-queries, here is the CTE version.
First the following will identify each row where there is a first time purchase and then get the sum of those values grouped by date:
;WITH cte
AS (
SELECT [Name]
,PurchaseAmount
,[date]
,ROW_NUMBER() OVER (
PARTITION BY [Name] ORDER BY [date] --start at 1 for each name at the earliest date and count up, reset every time the name changes
) AS rn
FROM yourTableName
)
SELECT [date]
,sum(PurchaseAmount) AS TotalFirstTimePurchases
FROM cte
WHERE rn = 1
GROUP BY [date]

Multiple filters on SQL query

I have been reading many topics about filtering SQL queries, but none seems to apply to my case, so I'm in need of a bit of help. I have the following data on a SQL table.
Date item quantity moved quantity in stock sequence
13-03-2012 16:51:00 xpto 2 2 1
13-03-2012 16:51:00 xpto -2 0 2
21-03-2012 15:31:21 zyx 4 6 1
21-03-2012 16:20:11 zyx 6 12 2
22-03-2012 12:51:12 zyx -3 9 1
So this is quantities moved in the warehouse, and the problem is on the first two rows which was a reception and return at the same time, because I'm trying to make a query which gives me the stock at a given time of all items. I use max(date) but i don't get the right quantity on result.
SELECT item, qty_in_stock
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY item ORDER BY item_date DESC, sequence DESC) rn
FROM mytable
WHERE item_date <= #date_of_stock
) q
WHERE rn = 1
If you are on SQL-Server 2012, these are several nice features added.
You can use the LAST_VALUE - or the FIRST_VALUE() - function, in combination with a ROWS or RANGE window frame (see OVER clause):
SELECT DISTINCT
item,
LAST_VALUE(quantity_in_stock) OVER (PARTITION BY item
ORDER BY date, sequence
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS quantity_in_stock
FROM tableX
WHERE date <= #date_of_stock
Add a where clause and do the summation:
select item, sum([quantity moved])
from t
group by item
where t.date <= #DESIREDDATETIME
If you put a date in for the desired datetime, remember that goes to midnight when the day starts.