Cumulative sum with proc-sql - proc-sql

I want to create a table "table_min_date_100d_per_country" which contains the first date where the cumulation by date of COVID cases exceeds 100 per country.
I have the columns date, cas_covid, country.
Sample data is..
Date Cas_covid country
2019-12-31 10 France
2020-01-01 15 France
2020-01-02 45 France
2020-01-03 5 France
2020-01-04 15 France
2020-01-05 11 France
The output is
2020-01-05 COVID cases = 101 country = France
Thanks.

If you are using SAS, it is much easier to get the cumulative sum with data step. There is no direct way of doing so with proc sql. Assuming your data is called "old_data" and it is already sorted by country and date, the following code will create a new dataset with the cumulative sum ("cum_sum") variable, by country:
data temp_data;
set old_data;
by country;
if first.country then cum_sum=0;
cum_sum+Cas_covid;
run;
After calculating the cumulative sum by country, you can get your desired output with proc sql, if you prefer, by evaluating only the results of cum_sum over 99 and keeping only the minimum for every country, like:
proc sql;
create table table_min_date_100d_per_country as
select distinct
date,
cum_sum as COVID_cases,
country
from temp_data
group by country /*This line gets you summarizing statistics by country*/
where cum_sum >= 100 /*This line says that you only evaluate results >= 100*/
having COVID_cases = min(COVID_cases) /*Within the end table, you only keep the minimum number of covid cases per country (after preselecting above 99)*/;
quit;
If your data is not sorted, you should first run
proc sort data=old_data;
by country date;
Best regards,

Related

Presenting summary information with splitting the value of a column

Is there a way to bring up the following with the table below:
customer_id | loan_date | loan_amount | loan_paid | status
------------+------------+-------------+-----------+--------
customer1 04/02/2010 5000 3850 active
customer2 04/02/2010 3000 3000 completed
customer3 04/02/2010 6500 4300 defaulted
...
Avg loan, the standard deviation of all the loans, the number of loans, the total amount of defaulted, and the total amount of collected loans per month. (I have data for about 5 years).
I have no idea of where to start.
Start like this:
SELECT date_trunc('month', loan_date)
, avg(loan_amount) AS avg_loan
, stddev_samp(loan_amount) AS stddev_samp
, count(*) AS ct_loans
, count(*) FILTER (WHERE status = 'defaulted') AS ct_defaulted
, sum(loan_paid) AS sum_paid
FROM tbl
GROUP BY 1
ORDER BY 1;
Then refine. Details are unclear. Not sure what loan_paid signifies exactly, and what you want to sum exactly. And there are multiple measures under the name of "standard deviation" ...
About aggregate functions.
About date_trunc().
About GROUP BY 1:
Concatenate multiple result rows of one column into one, group by another column
About the aggregate FILTER clause:
Aggregate columns with additional (distinct) filters

SQL - Monthly cumulative count of new customer based on a created date field

Thanks in advance.
I have Customer records that look like this:
Customer_Number
Create_Date
34343
01/22/2001
54554
03/03/2020
85296
01/01/2001
...
I have about a thousand of these records (customer number is unique) and the bossman wants to see how the number of customers has grown over time.
The output I need:
Customer_Count
Monthly_Bucket
7
01/01/2021
9
02/01/2021
13
03/01/2021
20
04/01/2021
The customer count is cumulative and the Monthly Bucket will just feed the graphing package to make a nice bar chart answering the question "how many customers to we have in total in a particular month and how is it growing over time".
Try the following SELECT SQL with a sub-query:
SELECT Customer_Count=
(
SELECT COUNT(s.[Create_Date])
FROM [Customer_Sales] s
WHERE MONTH(s.[Create_Date]) <= MONTH(t.[Create_Date])
), Monthly_Bucket=MONTH([Create_Date])
FROM Customer_Sales t
WHERE YEAR(t.[Create_Date]) = ????
GROUP BY MONTH(t.[Create_Date])
Where [Customer_Sales] is the sales table and ??? = your year

How to change the base year of constant dollars

I have a table that contains the monthly values ($) of building permits, per region, per type of structure. I have them in current dollars and constant 2012 dollars. I would like to change the constant dollars to a base of the most recent month, ie 2021-05.
The Worldbank says this about changing the base year of constant dollars:
For example, you can rescale the 2010 data to 2005 by first creating an index dividing each year of the constant 2010 series by its 2005 value (thus, 2005 will equal 1). Then multiply each year's index result by the corresponding 2005 current U.S. dollar price value.
My table looks something like this (in reality, there are many cities, each having many types of structures, eg: Residential, institutional, etc.):
Period City Type of structure Value valueAdjustment
2011-01-01 New York Commercial, total 125478 Current Dollars
2011-01-01 New York Commercial, total 129276 Constant dollars
2011-02-01 New York Commercial, total 120568 Current Dollars
2011-02-01 New York Commercial, total 124110 Constant dollars
...
2021-04-01 New York Commercial, total 197296 Current Dollars
2021-04-01 New York Commercial, total 154500 Constant dollars
2021-05-01 New York Commercial, total 155043 Current Dollars
2021-05-01 New York Commercial, total 121082 Constant dollars
What I've thought of doing is to create a column, Rank, to then use some variation of ROW_NUMBER to easily compare every month to 2021-05. I populated the rank like such:
WITH cteRank AS(
SELECT t.*,
Rnk = DENSE_RANK()OVER (ORDER BY YEAR([Period]), DATEPART(MONTH,[Period]) )
- COUNT(CASE WHEN YEAR([Period]) = 2021 AND DATEPART(MONTH,[Period])=2 THEN 1 END) OVER ()
- 1 +836
FROM [buildingPermits] t)
UPDATE cteRank SET [Rank] = Rnk FROM cteRank ;
The +836 is because with the way I coded it, it counts every instance of 2021-05, so I added the count to cancel it out. Not very efficient, but it works.
The resulting Rank column looks like this:
Period Rank City Type of structure Value valueAdjustment
2011-01-01 -124 New York Commercial, total 125478 Current Dollars
2011-01-01 -124 New York Commercial, total 129276 Constant dollars
2011-02-01 -123 New York Commercial, total 120568 Current Dollars
2011-02-01 -123 New York Commercial, total 124110 Constant dollars
...
2021-04-01 -1 New York Commercial, total 197296 Current Dollars
2021-04-01 -1 New York Commercial, total 154500 Constant dollars
2021-05-01 0 New York Commercial, total 155043 Current Dollars
2021-05-01 0 New York Commercial, total 121082 Constant dollars
The last step is the Worldbank's formula adapted to my need:
For example, you can rescale the 2012 data to 2021-05 by first creating an index dividing each year of the constant 2012 series by its 2021-05 value (thus, 2021-05 will equal 1). Then multiply each year's index result by the corresponding 2021-05 current U.S. dollar price value.
So for 2011-01, it would be:
(Constant 2011-01) / (Constant 2021-05) * (Current 2021-05)
129276 / 121082 * 155043
=165535
Here's some pseudo code for the division using a subquery, but it obviously returns an error because I didn't target specific cities and Type of Structure. Partitioning returned an error as well, Incorrect syntax near the keyword 'over'
SELECT Period, [Type of structure],City,
Value/ (SELECT Value
FROM [buildingPermits]
WHERE YEAR(Period) = 2021
and DATEPART(MONTH,Period) = 5
and valueAdjustment= 'Constant dollars' and [Rank] = 0)
FROM [buildingPermits]
WHERE valueAdjustment= 'Constant dollars'
Error: Subquery returned more than 1 value.
This is not permitted when the subquery follows =, !=, <, <= , >, >=
or when the subquery is used as an expression.
I think a SELF JOIN, a temp table, subquery or somehow using MAX(Rank) to get the value at rank 0 (2021-05) could do the trick, but I don't know how to go about implementing either of those solutions.
Any help is appreciated
Does this get you closer to what you want?
SELECT Const201101.[Period], Const201101.City, Const201101.[Type of structure],
Const201101.[Value] as [Const 201101 Value],
Const202105.[Value] as [Const 202105 Value],
Curr202105.[Value] as [Curr 202105 Value]
FROM [buildingPermits] Const201101
JOIN [buildingPermits] Const202105 ON
Const202105.City = Const201101.City AND
Const202105.[Type of structure] = Const201101.[Type of structure] AND
Const202105.[Period] = '2021-05' AND
Const202105.valueAdjustment = 'Constant dollars'
JOIN [buildingPermits] Curr202105 ON
Curr202105.City = Const201101.City AND
Curr202105.[Type of structure] = Const201101.[Type of structure] AND
Curr202105.[Period] = '2021-05' AND
Curr202105.valueAdjustment = 'Current Dollars'
WHERE Const201101.valueAdjustment= 'Constant dollars' AND Const201101.[Period] = '2011-01-01'

Conditional Sum SQL

I am very new to SQL and have been presented with, what seems to me, a complex task. I have a table which is generating the number of various fruit purchased on a given day. Thus:
G. A G.B
2016-06-01 Banana 45 0
2016-06-01 Pear 158 0
2016-06-01 apple 0 23
.... dates continue
I need to develop some kind of conditional sum to count how many types of fruit are bought with a specific grade on a specific date. So in the above case on the given date (2016-06-01) there would be 203 Grade A (G.A) bits of fruit and 23 Grade B (G.B) pieces of fruit.
Naturally some kind of
Sum(case when date=date then Grade else 0 ).
But, I am really baffled here. Please, any help would be greatly appreciated!!!!
A simple group clause should do the job here (Note: untested code)
select date, sum(grade_a) as grade_a_sum, sum(grade_b) as grade_b_sum
from sales
group by date;
This will give the grades for every date. Individual dates can then be selected if necessary.
Won't simple group by do the work..
Select
date,
sum(GA) as GA,
sum(GB) as GB
from
Table
group by date

Get latest cumulative sales amount for various evaluation dates in SAS

I have a list of evaluation dates stored in a table, datelist. It's technically two columns, start_date and end_date, for each evaluation period. The end_date will definitely need to be used, but the start_date may not. I only care about periods that are completed, so, for example, the period from 2016-01-01 to 2016-07-01 is in progress but not complete. So, it's not in the table.
start_date end_date
2012-01-01 2012-07-01
2012-07-01 2013-01-01
2013-01-01 2013-07-01
2013-07-01 2014-01-01
2014-01-01 2014-07-01
2014-07-01 2015-01-01
2015-01-01 2015-07-01
2015-07-01 2016-01-01
I have a separate table that lists cumulative sales by customer, sales_table with three columns, customer_ID, cumul_sales, transaction_date. For example, let's say customer 4793 bought $100 worth of stuff on 2/14/2014 and $200 worth of stuff on 3/30/2014 and $75 on 7/27/2014, the table will have the following rows:
customer_ID cumul_sales transaction_date
4793 100 2014-02-14
4793 300 2014-03-30
4793 375 2014-07-27
Now, for each evaluation date and for each customer, I want to know what's the cumulative sales as of the evaluation date for that customer? If a customer hadn't purchased anything by an evaluation date, then I wouldn't want a row for that customer at all corresponding to said evaluation date. This would be stored in a new table, called sales_by_eval, with columns customer_ID, cumul_sales, eval_date. For the example customer above, I'd have the following rows:
customer_ID cumul_sales eval_date
4793 300 2014-07-01
4793 375 2015-01-01
4793 375 2015-07-01
4793 375 2016-01-01
I can do this, but I'm looking to do it in an efficient way so I don't have to read through the data once for each evaluation date. If there are a lot of rows in the sales_table and 40 evaluation dates, that would be a large waste to read through the data 40 times, once for each evaluation date. Would it be possible with only one read through the data, for example?
The basic idea of the current process is a macro loop that loops once per evaluation period. Each loop has a data step that creates a new table (one table per loop) to check each transaction to see if it has occurred before or on the end_date of that corresponding evaluation period. That is, each table has all the transactions that occur before or on that evaluation date but none of the ones that occur after. Then, a later data step uses "last." to get only the last transaction for each customer before that evaluation date. And, finally, all the various tables created are put back together in another data step where they are all listed in the SET statement.
This is in SAS, so anything SAS can do, including SQL and macros, is fine with me.
In SAS, when you use group by statement, you can still use not grouping variables in select statement, like this:
proc sql;
create table sales_by_eval as
select s.customer_ID, s.cumul_sales, d.end_date as eval_date
from datelist d
join sales_table s
on d.end_date > s.transaction_date
group by s.customer_ID, d.end_date
having max(s.transaction_date) = s.transaction_date
;
quit;
This mean that for each combination of selected variablem SAS will return rekord with measures summarized within defined group. To limit the result to the last state of transaction value, use having condition, where you select only those records that have transaction_date equal to max(transaction_date) within s.customer_ID, d.end_date group.