Running total between two dates SQL - sql

I have a problem with building an efficient query in order to get a running total of sales between two dates.
Now I have the query :
select SalesId,
sum(Sales) as number_of_sales,
Sales_DATE as SalesDate,
ADD_MONTHS(Sales_DATE , -12) as SalesDatePrevYear
from DWH.L_SALES
group by SalesId, Sales_DATE
With the result:
| SalesId| number_of_sales| SalesDate|SalesDatePrevYear|
|:---- |:------:| :-----:|-----:|
| 1000| 1| 20200101|20190101|
| 1001| 1| 20220101|20210101|
| 1002| 1| 20220201|20210201|
| 1003| 1| 20220301|20210301|
The preferred result is the following:
| SalesId| number_of_sales| running total of sales | SalesDate|SalesDatePrevYear|
|:---- |:------:| :-----:| :-----:|-----:|
| 1000| 1| 1 | 20200101|20190101|
| 1001| 1| 1 | 20220101|20210101|
| 1002| 1| 2| 20220201|20210201|
| 1003| 1| 3|20220301|20210301|
As you can see, I want the total of Sales between the two dates, but because I also need the lower level (SalesId), it always stays at 1.
How can i get this efficiently?

You have successfully gotten the result which gives you the start and end dates that you care about, so you just need to take this result and then join it to the original data with an inequality join, and then sum the results. I suggest looking into the style of using CTE's (Common Table Expressions) which is helpful for learning and debugging.
For example,
WITH CTE_BASE_RESULT AS
(
your query goes here
)
SELECT CTE_BASE_RESULT.SalesId, CTE_BASE_RESULT.SalesDate, SUM(Sales) AS Total_Sales_Prior_Year
FROM CTE_BASE_RESULT
INNER JOIN DWH.L_Sales
ON CTE_BASE_RESULT.SalesId = L_Sales.SalesId
AND CTE_BASE_RESULT.SalesDate >= L_Sales.SalesDATE
AND CTE_BASE_RESULT.SalesDatePrevYear > L_Sales.SalesDATE
GROUP BY CTE_BASE_RESULT.SalesId, CTE_BASE_RESULT.SalesDate
I also recommend a website like SQL Generator that can help write complex operations, for example this is called Timeseries Aggregate.
This syntax works for snowflake, I didnt see what system you're on.
Alternatively,
WITH BASIC_OFFSET_1YEAR AS (
SELECT
A.Sales_Id,
A.SalesDate,
SUM(B.Sales) as SUM_SALES_PAST1YEAR
FROM
L_Sales A
INNER JOIN L_Sales B ON A.Sales_Id = B.Sales_Id
WHERE
B.SalesDate >= DATEADD(YEAR, -1, A.SalesDate)
AND B.SalesDate <= A.SalesDate
GROUP BY
A.Sales_Id,
A.SalesDate
)
SELECT
src.*, BASIC_OFFSET_1YEAR.SUM_SALES_PAST1YEAR
FROM
L_Sales src
LEFT OUTER JOIN BASIC_OFFSET_1YEAR
ON BASIC_OFFSET_1YEAR.SalesDate = src.SalesDate
AND BASIC_OFFSET_1YEAR.Sales_Id = src.Sales_Id

Related

Customer life cycle status analysis based on monthly activity

Hi my company wants to better tracks how many users are active on our platform. We are using Microsoft SQL Server 2019 as the Database, connected to the Azure Data Studio.
Below are two tables DDLs from our DB:
CALENDAR TABLE
COLUMN
DATA TYPE
DETAILS
CALENDAR_DATE
DATE NOT NULL
Base date (YYYY-MM-DD)
CALENDAR_YEAR
INTEGER NOT NULL
2010, 2011 etc
CALENDAR_MONTH_NUMBER
INTEGER NOT NULL
1-12
CALENDAR_MONTH_NAME
VARCHAR(100)
January, February etc
CALENDAR_DAY_OF_MONTH
INTEGER NOT NULL
1-31
CALENDAR_DAY_OF_WEEK
INTEGER NOT NULL
1-7
CALENDAR_DAY_NAME
INTEGER NOT NULL
Monday, Tuesday etc
CALENDAR_YEAR_MONTH
INTEGER NOT NULL,
201011, 201012, 201101 etc
REVENUE ANALYSIS
Column
Data Type
Details
ACTIVITY_DATE
DATE NOT NULL
Date Wager was made
MEMBER_ID
INTEGER NOT NULL
Unique Player identifier
GAME_ID
SMALLINT NOT NULL
Unique Game identifier
WAGER_AMOUNT
REAL NOT NULL
Total amount wagered on the game
NUMBER_OF_WAGERS
INTEGER NOT NULL
Number of wagers on the game
WIN_AMOUNT
REAL NOT NULL
Total amount won on the game
ACTIVITY_YEAR_MONTH
INTEGER NOT NULL
YYYYMM
BANK_TYPE_ID
SMALL INT DEFAULT 0 NOT NULL,
0=Real money, 1=Bonus money
Screenshot for both tables below:
CALENDAR TABLE
REVENUE ANALYSIS TABLE
Long story short "active" means that the member has made a minimum of one real money wager in the month.
Every month a member has a certain lifecycle type. This status will change on a monthly basis on their previous and current months activity. The statuses are the following:
NEW
First time they placed a real money wager
RETAINED
Active in the prior calendar month and the current calendar month
UNRETAINED
Active in the prior calendar month but not active in the current calendar month
REACTIVATED
Not active in the prior calendar month, but active in the current calendar month
LAPSED
Not active in the prior calendar month or the current calendar month
We would like initially to get to a view with the columns below:
MEMBER_ID |
CALENDAR_YEAR_MONT |
MEMBER_LIFECYCLE_STATUS |
LAPSED_MONTHS
Also the view should display one row per member per month, starting from the month in which they first placed a real money wager. This view should give their lifecycle status for that month, and if the member has lapsed, it should show a rolling count of the number of months since they were last active.
So far I have come up with the following CTE to give me a basis for the view. However I am not sure about the UNRETAINED and REACTIVATED columns. Any ideas anyone?
with all_activities as (
select a.member_id, activity_date, calendar_month_number as month_activity, calendar_year as year_activity,
datepart(month,CURRENT_TIMESTAMP) as current_month, datepart(year,CURRENT_TIMESTAMP) as current_year,
datepart(month,CONVERT(DATE, DATEADD(DAY,-DAY(GETDATE()),GETDATE()))) as previous_month, datepart(year,CONVERT(DATE, DATEADD(DAY,-DAY(GETDATE()),GETDATE()))) as year_last_month,
a.NUMBER_OF_WAGERS, (case when datepart(month,CURRENT_TIMESTAMP) = calendar_month_number and datepart(year,CURRENT_TIMESTAMP) = calendar_year then 'active' else 'inactive' end) as status,
case when (case when datepart(month,CURRENT_TIMESTAMP) = calendar_month_number and datepart(year,CURRENT_TIMESTAMP) = calendar_year then 'active' else 'inactive' end) = 'active' and number_of_wagers = 1 then 'New'
when (LAG((case when datepart(month,CURRENT_TIMESTAMP) = calendar_month_number and datepart(year,CURRENT_TIMESTAMP) = calendar_year then 'active' else 'inactive' end) ,1,0) OVER(PARTITION BY member_id ORDER BY calendar_month_number desc) = 'active' and calendar_month_number = datepart(month,CONVERT(DATE, DATEADD(DAY,-DAY(GETDATE()),GETDATE())))) then 'Retained'
when (calendar_month_number = datepart(month,CURRENT_TIMESTAMP) and year_activity = datepart(year,CURRENT_TIMESTAMP) and calendar_month_number = datepart(month,CONVERT(DATE, DATEADD(DAY,-DAY(GETDATE()),GETDATE())))) then 'Unretained'
from [dbo].[REVENUE_ANALYSIS] a
join CALENDAR b on a.ACTIVITY_DATE= b.CALENDAR_DATE
)
select * from all_activities
This is about customer lifecycle status analysis, which requires a couple of things:
customer acquisition date (it'll be nice to have this stored because some customers may go back to years or tens of years). For this question, we assume revenue_analysis has everthing we need and to calculate user acquisition month
lapsed vs churned: a churned customer is usually defined no activity for a period of time. For this question, we don't have the definition, thus, a user will be reported as lapsed forever.
For life cycle status calculation, we're going to gather the following (member_id, calendar_month, acquisition_month, activity_month, prior_activity_month), so that we can calculate the final result.
with cte_new_user_monthly as (
select member_id,
min(activity_year_month) as acquisition_month
from revenue_analysis
group by 1),
cte_user_monthly as (
select u.member_id,
u.acquisition_month,
m.yyyymm as calendar_month
from cte_new_user_monthly u,
calendar_month m
where u.acquisition_month <= m.yyyymm),
cte_user_activity_monthly as (
select f.member_id,
f.activity_year_month as activity_month
from revenue_analysis f
group by 1,2),
cte_user_lifecycle as (
select u.member_id,
u.calendar_month,
u.acquisition_month,
m.activity_month
from cte_user_monthly u
left
join cte_user_activity_monthly m
on u.member_id = m.member_id
and u.calendar_month = m.activity_month),
cte_user_status as (
select member_id,
calendar_month,
acquisition_month,
activity_month,
lag(activity_month,1) over (partition by member_id order by calendar_month) as prior_activity_month
from cte_user_lifecycle),
user_status_monthly as (
select member_id,
calendar_month,
activity_month,
case
when calendar_month = acquisition_month then 'NEW'
when prior_activity_month is not null and activity_month is not null then 'RETAINED'
when prior_activity_month is not null and activity_month is null then 'UNRETAINED'
when prior_activity_month is null and activity_month is not null then 'REACTIVATED'
when prior_activity_month is null and activity_month is null then 'LAPSED'
else null
end as user_status
from cte_user_status)
select member_id,
calendar_month,
activity_month,
user_status,
row_number() over (partition by member_id, user_status order by calendar_month) as months
from user_status_monthly
order by 1,2;
Result (include activity_month for easy understanding):
member_id|calendar_month|activity_month|user_status|months|
---------+--------------+--------------+-----------+------+
1001| 201701| 201701|NEW | 1|
1001| 201702| |UNRETAINED | 1|
1001| 201703| |LAPSED | 1|
1001| 201704| |LAPSED | 2|
1001| 201705| 201705|REACTIVATED| 1|
1001| 201706| 201706|RETAINED | 1|
1001| 201707| |UNRETAINED | 2|
1001| 201708| |LAPSED | 3|
1001| 201709| 201709|REACTIVATED| 2|
1001| 201710| |UNRETAINED | 3|
1001| 201711| |LAPSED | 4|
1001| 201712| 201712|REACTIVATED| 3|
1002| 201703| 201703|NEW | 1|
1002| 201704| |UNRETAINED | 1|
1002| 201705| |LAPSED | 1|
1002| 201706| |LAPSED | 2|
1002| 201707| |LAPSED | 3|
1002| 201708| |LAPSED | 4|
1002| 201709| |LAPSED | 5|
1002| 201710| |LAPSED | 6|
1002| 201711| |LAPSED | 7|
1002| 201712| |LAPSED | 8|
EDIT:
Codes tested in MySQL because I didn't notice 'mysql' tag was removed.
calendar_month in the code can be derived from the calendar dimension.

How to return all records with the latest datetime value [Postgreql]

How can I return only the records with the latest upload_date(s) from the data below?
My data is as follows:
upload_date |day_name |rows_added|row_count_delta|days_since_last_update|
-----------------------+---------+----------+---------------+----------------------+
2022-05-01 00:00:00.000|Sunday | 526043| | |
2022-05-02 00:00:00.000|Monday | 467082| -58961| 1|
2022-05-02 15:58:54.094|Monday | 421427| -45655| 0|
2022-05-02 18:19:22.894|Monday | 421427| 0| 0|
2022-05-03 16:54:04.136|Tuesday | 496021| 74594| 1|
2022-05-03 18:17:27.502|Tuesday | 496021| 0| 0|
2022-05-04 18:19:26.392|Wednesday| 487154| -8867| 1|
2022-05-05 18:18:15.277|Thursday | 489713| 2559| 1|
2022-05-06 16:15:39.518|Friday | 489713| 0| 1|
2022-05-07 16:18:00.916|Saturday | 482955| -6758| 1|
My desired results should be:
upload_date |day_name |rows_added|row_count_delta|days_since_last_update|
-----------------------+---------+----------+---------------+----------------------+
2022-05-01 00:00:00.000|Sunday | 526043| | |
2022-05-02 18:19:22.894|Monday | 421427| 0| 0|
2022-05-03 18:17:27.502|Tuesday | 496021| 0| 0|
2022-05-04 18:19:26.392|Wednesday| 487154| -8867| 1|
2022-05-05 18:18:15.277|Thursday | 489713| 2559| 1|
2022-05-06 16:15:39.518|Friday | 489713| 0| 1|
2022-05-07 16:18:00.916|Saturday | 482955| -6758| 1|
NOTE only the latest upload_date for 2022-05-02 and 2022-05-03 should be in the result set.
You can use a window function to PARTITION by day (casting the timestamp to a date) and sort the results by most recent first by ordering by upload_date descending. Using ROW_NUMBER() it will assign a 1 to the most recent record per date. Then just filter on that row number. Note that I am assuming the datatype for upload_date is TIMESTAMP in this case.
SELECT
*
FROM (
SELECT
your_table.*,
ROW_NUMBER() OVER (PARTITION BY CAST(upload_date AS DATE)
ORDER BY upload_date DESC) rownum
FROM your_table
)
WHERE rownum = 1
demo
WITH cte AS (
SELECT
max(upload_date) OVER (PARTITION BY upload_date::date),
upload_date,
day_name,
rows_added,
row_count_delta,
days_since_last_update
FROM test101 ORDER BY 1
)
SELECT
upload_date,
day_name,
rows_added,
row_count_delta,
days_since_last_update
FROM
cte
WHERE
max = upload_date;
This is more verbose but I find it easier to read and build:
SELECT *
FROM mytable t1
JOIN (
SELECT CAST(upload_date AS DATE) day_date, MAX(upload_date) max_date
FROM mytable
GROUP BY day_date) t2
ON t1.upload_date = t2.max_date AND
CAST(upload_date AS DATE) = t2.day_date;
I don't know about perfomance right away, but I suspect the window function is worse because you will need to order by, which is usually a slow operation unless your table already have an index for doing so.
Use DISTINCT ON:
SELECT DISTINCT ON (date_trunc('day', upload_date))
to_char(upload_date, 'Day') AS weekday, * -- added weekday optional
FROM tbl
ORDER BY date_trunc('day', upload_date), upload_date DESC;
db<>fiddle here
For few rows per day (like your sample data suggests) it's the simplest and fastest solution possible. See:
Select first row in each GROUP BY group?
I dropped the redundant column day_name from the table. That's just a redundant representation of the timestamp. Storing it only adds cost and noise and opportunities for inconsistent data. If you need the weekday displayed, use to_char(upload_date, 'Day') AS weekday like demonstrated above.
The query works for any number of days, not restricted to 7 weekdays.

How can I summarize data by year in SQL?

I'm sure the request is rather straight-forward, but I'm stuck. I'd like to take the first table below and turn it into the second table by summing up Incremental_Inventory by Year.
+-------------+-----------+----------------------+-----+
|Warehouse_ID |Date |Incremental_Inventory |Year |
+-------------+-----------+----------------------+-----+
| 1|03/01/2010 |125 |2010 |
| 1|08/01/2010 |025 |2010 |
| 1|02/01/2011 |150 |2011 |
| 1|03/01/2011 |200 |2011 |
| 2|03/01/2012 |125 |2012 |
| 2|03/01/2012 |025 |2012 |
+-------------+-----------+----------------------+-----+
to
+-------------+-----------+---------------------------+
|Warehouse_ID |Date |Cumulative_Yearly_Inventory|
+-------------+-----------+---------------------------+
| 1|03/01/2010 |125 |
| 1|08/01/2010 |150 |
| 1|02/01/2011 |150 |
| 1|03/01/2011 |350 |
| 2|03/01/2012 |125 |
| 2|03/01/2012 |150 |
+-------------+-----------+---------------------------+
If your DBMS, which you haven't told us, supports window functions you could simply do something like:
SELECT warehouse_id,
date,
sum(incremental_inventory) OVER (PARTITION BY warehouse_id,
year(date)
ORDER BY date) cumulative_yearly_inventory
FROM elbat
ORDER BY date;
year() maybe needs to replaced by the means your DBMS provides to extract the year from a date.
If it doesn't support window functions you had to use a subquery and aggregation.
SELECT t1.warehouse_id,
t1.date,
(SELECT sum(t2.incremental_inventory)
FROM elbat t2
WHERE t2.warehouse_id = t1.warehouse_id
AND year(t2.date) = year(t1.date)
AND t2.date <= t1.date) cumulative_yearly_inventory
FROM elbat t1
ORDER BY t1.date;
However, if there are two equal dates, this will print the same sum for both of them. One would need another, distinct column to sort that out and as far as I can see you don't have such a column in the table.
I'm not sure if you want the sum over all warehouses or only per warehouse. If you don't want the sums split by warehouses but one sum for all warehouses together, remove the respective expressions from the PARTITION BY or inner WHERE clause.
If you have SAS/ETS then the time series tasks will do this for you. Assuming not, here's a data step solution.
Use RETAIN to hold value across rows
Use BY to identify the first record for each year
data want;
set have;
by year;
retain cum_total;
if first.year then cum_total=incremental_inventory;
else cum_total+incremental_inventory;
run;

Hive- Error : missing EOF at 'WHERE'

I'm trying to learn Hive, especially functions like unix_timestamp and from_unixtime.
I have three tables
emp (employee table)
+---+----------------+
| id| name|
+---+----------------+
| 1| James Gordon|
| 2| Harvey Bullock|
| 3| Kristen Kringle|
+---+----------------+
txn (transaction table)
+------+----------+---------+
|acc_id|trans_date|trans_amt|
+------+----------+---------+
| 101| 20180105| 951|
| 102| 20180205| 800|
| 103| 20180131| 100|
| 101| 20180112| 50|
| 102| 20180126| 800|
| 103| 20180203| 500|
+------+----------+---------+
acc (account table)
+---+------+--------+
| id|acc_id|cred_lim|
+---+------+--------+
| 1| 101| 1000|
| 2| 102| 1500|
| 3| 103| 800|
+---+------+--------+
I want to find out the people whose trans_amt exceeded their cred_lim in the month of Jan 2018.
The query I'm trying to use is
WITH tabl as
(
SELECT e.id, e.name, a.acc_id, t.trans_amt, a.cred_lim, from_unixtime(unix_timestamp(t.trans_date, 'yyyyMMdd'), 'MMM yyyy') month
FROM emp e JOIN acc a on e.id = a.id JOIN txn t on a.acc_id = t.acc_id
)
SELECT acc_id, sum(trans_amt) total_amt
FROM tabl
GROUP BY tabl.acc_id, tabl.month
WHERE tabl.month = 'Jan 2018' AND tabl.total_amt > cred_lim;
But when I run it, I get an error saying
FAILED: ParseException line 9:2 missing EOF at 'WHERE' near 'month'
This error persists even when I change the where clause to
WHERE tabl.total_amt > cred_lim;
This makes me think the error comes from the GROUP BY clause but I can't seem to figure this out.
Could someone help me with this?
Your query has several problems.
WHERE clause should be used before GROUP BY
There is an extra ')' after GROUP BY columns
tabl.total_amt > cred_lim - This line cannot be used in where
clause because the alias total_amt cannot be used before it is
nested. Instead, use a HAVING clause.
I've made these changes in this query and should work for you.
WITH tabl
AS (
SELECT e.id
,e.name
,a.acc_id
,t.trans_amt
,a.cred_lim
,from_unixtime(unix_timestamp(t.trans_date, 'yyyyMMdd'), 'MMM yyyy') month
FROM emp e
INNER JOIN acc a ON e.id = a.id
INNER JOIN txn t ON a.acc_id = t.acc_id
)
SELECT acc_id
,sum(trans_amt) total_amt
FROM tabl
WHERE month = 'Jan 2018'
GROUP BY acc_id
,month
HAVING SUM(trans_amt) > MAX(cred_lim);

Counting Combinations of rows/columns across one table

I have a query I am having trouble wrapping my head around. What I'm trying to do is come up with a report that will have rows of the major(Accounting, Business, etc) with columns of the type of enrollment(enrolled, withdrawn, etc) with counts for each. Right now, here is my query.
SELECT datatel_academicprogramofinterestidname, datatel_prospectstatusname
FROM FilteredContact
GROUP BY datatel_academicprogramofinterestidname, datatel_prospectstatusname
Which gives me every combination of these two fields found in my table. I want to get counts for each of the combinations to display. The rows would be the interestidname field, and the columns would be the prospectstatusname field. In every cell there would be a count for how many of that specific combination (Accounting/Enrolled, for example)
I've tried count in multiple ways but can't seem to get it to break the columns out the way I want. Not sure how to use group by, count, and my where clause all in conjunction to have it formatted how I want. The good thing is all my information is in one table, I just can't make it look how I want.
|Accounting (BS) | Accepted | 25|
|Acting (BFA) | Accepted | 32|
|Advertising & Marketing Communications (BA) | Accepted | 29|
|American Studies (BA) | Accepted | 2|
|Accounting (BS) | Enrolled | 5|
|Acting (BFA) | Enrolled | 17|
|Advertising & Marketing Communications (BA) | Enrolled | 40|
|American Studies (BA) | Enrolled | 10|
You may be able to use a pivot here. I created a SQL Fiddle to go along with this. Assuming your primary key field name is p_id
SELECT *
FROM FilteredContact
PIVOT ( count(p_id)
for datatel_prospectstatusname IN ([Accepted], [Enrolled])
) as p
You can easily turn the rows of data into columns using an aggregate function with a CASE expression:
select interestidname,
sum(case when datatel_prospectstatusname = 'Enrolled' then 1 else 0 end) Enrolled,
sum(case when datatel_prospectstatusname = 'Accepted' then 1 else 0 end) Accepted
from FilteredContact
group by interestidname