combine two rows with 2 months into one row of one month, containing null values into one - sql

I would like to have a dataframe where 1 row only contains one month of data.
month cust_id closed_deals cum_closed_deals checkout cum_checkout
2019-10-01 1 15 15 null null
2019-10-01 1 null 15 210 210
2019-11-01 1 27 42 null 210
2019-11-01 1 null 42 369 579
Expected result:
month cust_id closed_deals cum_closed_deals checkout cum_checkout
2019-10-01 1 15 15 210 210
2019-11-01 1 27 42 369 579
At first, I thought a normal groupby will work, but as I try to group by only by "month" and "cust_id", I got an error saying that closed_deals and checkout also need to be in the groupby.

You may simply aggregate by the (first of the) month and cust_id and take the max of all other columns:
SELECT
month,
cust_id,
MAX(closed_deals) AS closed_deals,
MAX(cum_closed_deals) AS cum_closed_deals,
MAX(checkout) AS checkout,
MAX(cum_checkout) AS cum_checkout
FROM yourTable
GROUP BY
month,
cust_id;

Related

How do you get the last entry for each month in SQL?

I am looking to filter very large tables to the latest entry per user per month. I'm not sure if I found the best way to do this. I know I "should" trust the SQL engine (snowflake) but there is a part of me that does not like the join on three columns.
Note that this is a very common operation on many big tables, and I want to use it in DBT views which means it will get run all the time.
To illustrate, my data is of this form:
mytable
userId
loginDate
year
month
value
1
2021-01-04
2021
1
41.1
1
2021-01-06
2021
1
411.1
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-06
2021
2
32
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
And I'm trying to use SQL to get the last value (by loginDate) for each month.
I'm currently doing a groupby & a join as follows:
WITH latest_entry_by_month AS (
SELECT "userId", "year", "month", max("loginDate") AS "loginDate"
FROM mytable
)
SELECT * FROM mytable NATURAL JOIN latest_entry_by_month
The above results in my desired output:
userId
loginDate
year
month
value
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
But I'm not sure if it's optimal.
Any guidance on how to do this faster? Note that I am not materializing the underlying data, so it is effectively un-clustered (I'm getting it from a vendor via the Snowflake marketplace).
Using QUALIFY and windowed function(ROW_NUMBER):
SELECT *
FROM mytable
QUALIFY ROW_NUMBER() OVER(PARTITION BY userId, year, month
ORDER BY loginDate DESC) = 1

How do I select / identify a row based on criteria in a different row in SQL

I've never posted on here before but, I am really stumped on this and looking for any assistance I get! I am not the best SQL code writer and I do not understand every concept but I am quick learner. So, I am not sure this is best way to accomplish my goal and if there is a more efficient way to complete this, I would be open to learning. I appreciate any help that can be provided.
Task:
I am attempting to write a SQL code that will help me place a number under the "Grab" column that allows me to exclude other rows out that are not needed.
Issue:
Pricing has a timeframe when it is applicable. The [PriceBookTable] captures the time frame range for each price book that is listed. However, as time goes on, some price books become outdated and do not need to be reviewed.
Based on today's date, I am trying to identify the previous version price book as well as the next version (if there is one).
Table Used: [PriceBookTable]
ID
Description
CategoryID
ParentID
StartDate
EndDate
412
56 MSRP
56
NULL
NULL
NULL
413
3 MSRP
3
NULL
NULL
NULL
414
61 MSRP
61
NULL
NULL
NULL
415
63 MSRP
63
NULL
NULL
NULL
419
58 MSRP
58
NULL
NULL
NULL
420
62 MSRP
62
NULL
NULL
NULL
430
67 MSRP
67
NULL
NULL
NULL
431
68 MSRP
68
NULL
NULL
NULL
505
2020 Version 1
56
412
2020-08-31
2020-12-31
537
2021 Version 1
56
412
2021-01-01
2021-03-31
586
2021 Version 2
56
412
2021-04-01
2021-04-13
622
2021 Version 3
56
412
2021-04-14
2021-07-31
688
2021 Version 4
56
412
2021-08-01
2021-12-31
Current Code:
USE [Database]
DECLARE #PriceBookID AS VARCHAR(10)
SET #PriceBookID = '412' --Parent Price Book ID
SELECT A.*,
[Grab] = CASE WHEN A.ParentID IS NULL AND A.StartDate IS NULL AND A.EndDate IS NULL THEN 1 -- Always needs to be #1
WHEN CAST(GETDATE() AS DATE) BETWEEN A.StartDate AND A.EndDate THEN 3 --Currently Active Price Book based on Today's Date
ELSE NULL END
FROM( SELECT ID,
ParentID,
[PriceBookDescription] = Description,
StartDate,
EndDate,
[ActivePriceBook] = CASE WHEN CAST(GETDATE() AS DATE) BETWEEN StartDate AND EndDate THEN 'Active' ELSE NULL END,
[PBOrder] = ROW_NUMBER() OVER (ORDER BY ID ASC)
FROM [PriceBookTable]
WHERE 1=1 AND ID IN (#PriceBookID) OR ParentID IN (#PriceBookID)) A
Current Output:
ID
ParentID
PriceBookDescription
StartDate
EndDate
ActivePriceBook
PBOrder
Grab
412
Null
MSRP
NULL
NULL
NULL
1
1
505
412
2020 Version 1
2020-08-31
2020-12-31
NULL
2
NULL
537
412
2021 Version 1
2021-01-01
2021-03-31
NULL
3
NULL
586
412
2021 Version 2
2021-04-01
2021-04-13
NULL
4
NULL
622
412
2021 Version 3
2021-04-14
2021-07-31
Active
5
3
688
412
2021 Version 4
2021-08-01
2021-12-31
NULL
6
NULL
Notes:
I originally was hoping that the "PBOrder" column would be useful for me but, as time goes on the list becomes bigger as more price books are created and, for example, row #4 [ID 586] will not always be relevant.
I would have just placed a "WHERE ID IN ('412','586','622','688')" statement but the ID's change based on different categories (not shown). So, I am stuck to the date range.
Desired Output:
ID
ParentID
PriceBookDescription
StartDate
EndDate
ActivePriceBook
PBOrder
Grab
412
Null
MSRP
NULL
NULL
NULL
1
1
586
412
2021 Version 2
2021-04-01
2021-04-13
NULL
4
2
622
412
2021 Version 3
2021-04-14
2021-07-31
Active
5
3
688
412
2021 Version 4
2021-08-01
2021-12-31
NULL
6
4
I hope this makes sense and please let me know if you have any questions regarding this.
Thank you again for any help!
Took me awhile to understand what you wanted, but after figuring it out I was able to address what you need. Basically, you want:
To identify a single active record within a category based on the current date.
Then get the adjacent inactive records, with respect to time, that share a parent record.
Then get the record for the parent category and include it in the result set.
The 'pbOrder' and 'grab' columns seem to be throughputs to achieve this goal. You don't need them in the output.
If this is all correct, then you can delegate your identification of an active record to a cross apply calculation, and then use lead and lag in addition to the raw result to identify the active record as well as the adjacent ones in time.
declare #PriceBookID int = 412; -- why varchar, I would use int
with rowsToGrab as (
select pbt.*,
ap.activePriceBook,
grab =
case
when pbt.ParentID is null then 1
when lead(ap.ActivePriceBook) over(order by pbt.startDate) is not null then 1
when lag(ap.ActivePriceBook) over(order by pbt.startDate) is not null then 1
when ap.ActivePriceBook is not null then 1
end
from #PriceBookTable pbt
cross apply (select ActivePriceBook =
case
when cast(getdate() as date) between startdate and enddate then 'Active'
end
) ap
where #PriceBookID in (ID, ParentID)
)
select id, ParentID, description as PriceBookDescription, StartDate, EndDate, ActivePriceBook
from rowsToGrab
where grab is not null
order by id, StartDate
This produces:
id
ParentID
PriceBookDescription
StartDate
EndDate
ActivePriceBook
412
56 MSRP
586
412
2021 Version 2
2021-04-01
2021-04-13
622
412
2021 Version 3
2021-04-14
2021-07-31
Active
688
412
2021 Version 4
2021-08-01
2021-12-31

SQL how to count but only count one instance if two columns match?

Wondering how to select from a table:
FIELDID personID purchaseID dateofPurchase
--------------------------------------------------
2 13 147 2014-03-21 00:00:00
3 15 165 2015-03-23 00:00:00
4 13 456 2018-03-24 00:00:00
5 1 133 2018-03-21 00:00:00
6 23 123 2013-03-22 00:00:00
7 25 456 2013-03-21 00:00:00
8 25 456 2013-03-23 00:00:00
9 22 456 2013-03-28 00:00:00
10 25 589 2013-03-21 00:00:00
11 82 147 1991-10-22 00:00:00
12 82 453 2003-03-22 00:00:00
I'd like to get a result table of two columns: weekday and the number of purchases of each weekday, but only count the distinct days of purchases if done by the same person on the same day - for example since personID 25 purchased two things on 2013-03-21, that should only count as one 'thursday' instead of 2.
Basically, if the personID and the dateofPurchase are the same for more than one row, only count it once is what I want.
Here is what I have currently: It does everything correctly except it will count the above scenario under the thursday twice, when I would only want to add one:
SELECT v.wkday as day, COUNT(*) as 'absences'
FROM dbo.AttendanceRecord pr CROSS APPLY
(VALUES (CASE WHEN DATEPART(WEEKDAY, date) IN (1, 7)
THEN 'Weekend'
ELSE DATENAME(WEEKDAY, date)
END)
) v(wkday)
GROUP BY v.wkday;
to clarify:
If an item is purchased for at least one puchaseID on a specific day they will be counted as purchased for that day, and do not need to be counted again for each new purchase ID on that day.
I think you want to count distinct persons, so that would be:
COUNT(DISTINCT personid) as absences
Note that single quotes are not appropriate around column aliases. If you need to escape them, use square braces.
EDIT:
If you want to count distinct person-days, then you can use:
COUNT(DISTINCT CONCAT(personid, ':', dateofpurchase) as absences

Creating a new calculated column in SQL

Is there a way to find the solution so that I need for 2 days, there are 2 UD's because there are June 24 2 times and for the rest there are single days.
I am showing the expected output here:
Primary key UD Date
-------------------------------------------
1 123 2015-06-24 00:00:00.000
6 456 2015-06-24 00:00:00.000
2 123 2015-06-25 00:00:00.000
3 658 2015-06-26 00:00:00.000
4 598 2015-06-27 00:00:00.000
5 156 2015-06-28 00:00:00.000
No of times Number of days
-----------------------------
4 1
2 2
The logic is 4 users are there who used the application on 1 day and there are 2 userd who used the application on 2 days
You can use two levels of aggregation:
select cnt, count(*)
from (select date, count(*) as cnt
from t
group by date
) d
group by cnt
order by cnt desc;

How to calculate a running total that is a distinct sum of values

Consider this dataset:
id site_id type_id value date
------- ------- ------- ------- -------------------
1 1 1 50 2017-08-09 06:49:47
2 1 2 48 2017-08-10 08:19:49
3 1 1 52 2017-08-11 06:15:00
4 1 1 45 2017-08-12 10:39:47
5 1 2 40 2017-08-14 10:33:00
6 2 1 30 2017-08-09 07:25:32
7 2 2 32 2017-08-12 04:11:05
8 3 1 80 2017-08-09 19:55:12
9 3 2 75 2017-08-13 02:54:47
10 2 1 25 2017-08-15 10:00:05
I would like to construct a query that returns a running total for each date by type. I can get close with a window function, but I only want the latest value for each site to be summed for the running total (a simple window function will not work because it sums all values up to a date--not just the last values for each site). So I guess it could be better described as a running distinct total?
The result I'm looking for would be like this:
type_id date sum
------- ------------------- -------
1 2017-08-09 06:49:47 50
1 2017-08-09 07:25:32 80
1 2017-08-09 19:55:12 160
1 2017-08-11 06:15:00 162
1 2017-08-12 10:39:47 155
1 2017-08-15 10:00:05 150
2 2017-08-10 08:19:49 48
2 2017-08-12 04:11:05 80
2 2017-08-13 02:54:47 155
2 2017-08-14 10:33:00 147
The key here is that the sum is not a running sum. It should only be the sum of the most recent values for each site, by type, at each date. I think I can help explain it by walking through the result set I've provided above. For my explanation, I'll walk through the original data chronologically and try to explain the expected result.
The first row of the result starts us off, at 2017-08-09 06:49:47, where chronologically, there is only one record of type 1 and it is 50, so that is our sum for 2017-08-09 06:49:47.
The second row of the result is at 2017-08-09 07:25:32, at this point in time we have 2 unique sites with values for type_id = 1. They have values of 50 and 30, so the sum is 80.
The third row of the result occurs at 2017-08-09 19:55:12, where now we have 3 sites with values for type_id = 1. 50 + 30 + 80 = 160.
The fourth row is where it gets interesting. At 2017-08-11 06:15:00 there are 4 records with a type_id = 1, but 2 of them are for the same site. I'm only interested in the most recent value for each site so the values I'd like to sum are: 30 + 80 + 52 resulting in 162.
The 5th row is similar to the 4th since the value for site_id:1, type_id:1 has changed again and is now 45. This results in the latest values for type_id:1 at 2017-08-12 10:39:47 are now: 30 + 80 + 45 = 155.
Reviewing the 6th row is also interesting when we consider that at 2017-08-15 10:00:05, site 2 has a new value for type_id 1, which gives us: 80 + 45 + 25 = 150 for 2017-08-15 10:00:05.
You can get a cumulative total (running total) by including an ORDER BY clause in your window frame.
select
type_id,
date,
sum(value) over (partition by type_id order by date) as sum
from your_table;
The ORDER BY works because
The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
SELECT type_id,
date,
SUM(value) OVER (PARTITION BY type_id ORDER BY type_id, date) - (SUM(value) OVER (PARTITION BY type_id, site_id ORDER BY type_id, date) - value) AS sum
FROM your_table
ORDER BY type_id,
date