Get the number of unique days with overlapping dates (in SAS) - sql

I couldn't briefly explain the problem so I'll try to explain it this way. Let's say I have a table similar to the one below.
How do I get the total number of days in October per student that that student has at least 1 book checked out?
Please note that a single student can check out more than 1 book at a time which cause the overlapping dates.
Student
Book
Date_Borrowed
Date_Returned
David
A Thousand Splendid Suns
01 Oct 2021
05 Oct 2021
David
Jane Eyre
09 Oct 2021
13 Oct 2021
David
Please Look After Mom
21 Oct 2021
29 Oct 2021
Fiona
Sense and Sensibility
05 Oct 2021
14 Oct 2021
Fiona
The Girl Who Saved the King of Sweden
05 Oct 2021
14 Oct 2021
Fiona
A Fort of Nine Towers
02 Oct 2021
17 Oct 2021
Fiona
One Hundred Years of Solitude
20 Oct 2021
30 Oct 2021
Fiona
The Unbearable Lightness of Being
20 Oct 2021
30 Oct 2021
Greg
Fahrenheit 451
06 Oct 2021
11 Oct 2021
Greg
One Hundred Years of Solitude
10 Oct 2021
17 Oct 2021
Greg
Please Look After Mom
15 Oct 2021
21 Oct 2021
Greg
4 3 2 1
20 Oct 2021
27 Oct 2021
Greg
The Girl Who Saved the King of Sweden
27 Oct 2021
03 Nov 2021
Marcus
Fahrenheit 451
01 Oct 2021
04 Oct 2021
Marcus
Nectar in a Sieve
15 Oct 2021
15 Oct 2021
Marcus
Please Look After Mom
30 Oct 2021
31 Oct 2021
Priya
Like Water for Chocolate
02 Oct 2021
21 Oct 2021
Priya
Fahrenheit 451
21 Oct 2021
22 Oct 2021
Sasha
Baudolino
03 Oct 2021
29 Oct 2021
Sasha
A Thousand Splendid Suns
07 Oct 2021
16 Oct 2021
Sasha
A Fort of Nine Towers
26 Oct 2021
01 Nov 2021
Thanks in advance!

Using the data step, you can expand each date into a long format. From there, you can use SQL to do a simple count by student after removing overlapping dates.
data foo;
set have;
do date = date_borrowed to date_returned;
output;
end;
keep student date;
format date date9.;
run;
This gets us a long table of all the dates with at least one book checked out for each student.
student date
David 01OCT2021
David 02OCT2021
David 03OCT2021
David 04OCT2021
David 05OCT2021
David 09OCT2021
...
Now we need to remove the overlapping dates.
proc sort data=foo nodupkey;
by student date;
run;
From here, we can do a simple SQL count per student.
proc sql noprint;
create table want as
select student
, intnx('month', date, 0, 'B') as month format=monyy7.
, count(*) as days_checked_out
from foo
where calculated month = '01OCT2021'd
group by student, calculated month
;
quit;
Output:
student month days_checked_out
David OCT2021 19
Fiona OCT2021 27
Greg OCT2021 26
Marcus OCT2021 7
Priya OCT2021 21
Sasha OCT2021 29

An easy way is to make a temporary array with one variable for each day in the time period you want to count. Then just use a do loop to set the variables representing those days to 1. When you have reached the last record for a student then take the sum to find the number of days covered.
First let's convert your posted table into a dataset.
data have;
infile cards dsd dlm='|' truncover;
input Student :$20. Book :$100. (Date_Borrowed Date_Returned) (:date.);
format Date_Borrowed Date_Returned date11.;
cards;
David|A Thousand Splendid Suns|01 Oct 2021|05 Oct 2021
David|Jane Eyre|09 Oct 2021|13 Oct 2021
David|Please Look After Mom|21 Oct 2021|29 Oct 2021
Fiona|Sense and Sensibility|05 Oct 2021|14 Oct 2021
Fiona|The Girl Who Saved the King of Sweden|05 Oct 2021|14 Oct 2021
Fiona|A Fort of Nine Towers|02 Oct 2021|17 Oct 2021
Fiona|One Hundred Years of Solitude|20 Oct 2021|30 Oct 2021
Fiona|The Unbearable Lightness of Being|20 Oct 2021|30 Oct 2021
Greg|Fahrenheit 451|06 Oct 2021|11 Oct 2021
Greg|One Hundred Years of Solitude|10 Oct 2021|17 Oct 2021
Greg|Please Look After Mom|15 Oct 2021|21 Oct 2021
Greg|4 3 2 1|20 Oct 2021|27 Oct 2021
Greg|The Girl Who Saved the King of Sweden|27 Oct 2021|03 Nov 2021
Marcus|Fahrenheit 451|01 Oct 2021|04 Oct 2021
Marcus|Nectar in a Sieve|15 Oct 2021|15 Oct 2021
Marcus|Please Look After Mom|30 Oct 2021|31 Oct 2021
Priya|Like Water for Chocolate|02 Oct 2021|21 Oct 2021
Priya|Fahrenheit 451|21 Oct 2021|22 Oct 2021
Sasha|Baudolino|03 Oct 2021|29 Oct 2021
Sasha|A Thousand Splendid Suns|07 Oct 2021|16 Oct 2021
Sasha|A Fort of Nine Towers|26 Oct 2021|01 Nov 2021
;
Now we can use BY group processing in a data step to aggregate per student. We can set the upper and lower index for the array to be the values SAS uses to represent those days. Temporary arrays are automatically retained across observations, we just need to clear it out when we start a new student.
The SAS compiler does not expect to see a date literal as the index boundaries for an array so we can use %SYSEVALF() to convert the date literal to the integer it represents.
data want;
set have;
by student ;
array october [%sysevalf('01oct2021'd):%sysevalf('31oct2021'd)] _temporary_ ;
if first.student then call missing(of october[*]);
do date=max(date_borrowed,'01oct2021'd) to min(date_returned,'31oct2021'd);
october[date]=1;
end;
if last.student;
days = sum(0, of october[*]);
keep student days;
run;
Results:
Obs Student days
1 David 19
2 Fiona 27
3 Greg 26
4 Marcus 7
5 Priya 21
6 Sasha 29
You could also modify it slightly to not only count the number of "covered" (or unique) days, but also the total number of "book" days.
data want;
set have;
by student ;
array october [%sysevalf('01oct2021'd):%sysevalf('31oct2021'd)] _temporary_ ;
if first.student then call missing(of october[*]);
do date=max(date_borrowed,'01oct2021'd) to min(date_returned,'31oct2021'd);
october[date]=sum(october[date],1);
end;
if last.student;
unique_days = n(of october[*]);
book_days = sum(0,of october[*]);
keep student unique_days book_days;
run;
Results:
unique_ book_
Obs Student days days
1 David 19 19
2 Fiona 27 58
3 Greg 26 34
4 Marcus 7 7
5 Priya 21 22
6 Sasha 29 43

Related

Is there a way to rejig a dataframe to show a better time series dataset?

Hi I have the following df:
Variable Total Month
Year
2011 110 01
2011 111 02
2011 112 03
2011 113 04
2011 114 05
2011 115 06
....
....
2021 302 04
2021 303 05
2021 304 06
Is it possible to rejig the dataset to this:
Jan Feb Mar Apr May .... Nov Dec
Year
2011 110 111 112 113 114
2012 ...
2013 ...
2014 ...
2015 ...
....
2020
2021
** I would also like to remove the "Variable" word at the corner of the table.
My eventual goal is to do some simple data visualization using matplotlib to create line plots of the respective years (2011...2021)
Thank you in advance!
Use pivot()+reindex():
from calendar import month_abbr
df['Month']=pd.to_datetime(df['Month'],format='%m').dt.strftime('%b')
df=df.pivot(columns='Month',values='Total').rename_axis(columns=None)
df=df.reindex(columns=month_abbr[1:])
OR
via pivot()+pd.Categorical():
df['Month']=pd.to_datetime(df['Month'],format='%m').dt.strftime('%b')
df=df.pivot(columns='Month',values='Total').rename_axis(columns=None)
df.columns=pd.Categorical(df.columns,month_abbr[1:],ordered=True)
df=df.sort_index(axis=1)
Now if you print df you will get your expected output

How to calculate median monthly from date of month table?

My dataset:
Date Num_orders
Mar 21 2019 69
Mar 22 2019 82
Mar 24 2019 312
Mar 25 2019 199
Mar 26 2019 2,629
Mar 27 2019 2,819
Mar 28 2019 3,123
Mar 29 2019 3,332
Mar 30 2019 1,863
Mar 31 2019 1,097
Apr 01 2019 1,578
Apr 02 2019 2,353
Apr 03 2019 2,768
Apr 04 2019 2,648
Apr 05 2019 3,192
Apr 06 2019 2,363
Apr 07 2019 1,578
Apr 08 2019 3,090
Apr 09 2019 3,814
Apr 10 2019 3,836
...
I need to calculate the monthly median number of orders from days of the same month:
The desired results:
Month Median_monthly
Mar 2019 1,863
Apr 2019 2,768
May 2019 2,876
Jun 2019 ...
...
I tried to use function date_trunc to extract month from the dataset then group by 'month' but it didn't work out. Thanks for your help, I use Google Bigquery (#standard) environment!
Probably you tried to use PERCENTILE_CONT which can not be used with GROUP BY:
Try to use APPROX_QUANTILES(x, 100)[OFFSET(50)]. It should work with GROUP BY.
SELECT APPROX_QUANTILES](Num_orders, 100)\[OFFSET(50)\] AS median
FROM myTable
GROUP BY Month
Alternativele you can use PERCENTILE_CONT within subquery:
SELECT
DISTINCT Month, median
FROM (
SELECT
Month,
PERCENTILE_CONT(Num_orders, 0.5) OVER(PARTITION BY Month) AS median
FROM myTable
)
This would often be done using DISTINCT:
SELECT DISTINCT DATE_TRUNC(month, date),
PERCENTILE_CONT(Num_orders, 0.5) OVER (PARTITION BY DATE_TRUNC(month, date) AS median
FROM myTable;
Note: There are two percentile functions, PERCENTILE_CONT() and PERCENTILE_DISC(). They have different results when there is a "tie" in the middle of the data.

Calculation of values that rely on a date variable

I am trying to calculate the value of the last measurement taken (according to the date column) divided by the lowest value recorded (according to the measurement column) if two values in the “SUBJECT” column match and two values in the “PROCEDURE” column match. The the calculation would be produced in a new column. I am having trouble with this and I would appreciate a solution to this matter.
data Have;
input Subject Type :$12. Date &:anydtdte. Procedure :$12. Measurement;
format date yymmdd10.;
datalines;
500 Initial 15 AUG 2017 Invasive 20
500 Initial 15 AUG 2017 Surface 35
500 Followup 15 AUG 2018 Invasive 54
428 Followup 15 AUG 2018 Outer 29
765 Seventh 3 AUG 2018 Other 13
500 Followup 3 JUL 2018 Surface 98
428 Initial 3 JUL 2017 Outer 10
765 Initial 20 JUL 2019 Other 19
610 Third 20 AUG 2019 Invasive 66
610 Initial 17 Mar 2018 Invasive 17
;
*Intended output table
Subject Type Date Procedure Measurement Output
500 Initial 15 AUG 2017 Invasive 20 20/20
500 Initial 15 AUG 2017 Surface 35 35/35
500 Followup 15 AUG 2018 Invasive 54 54/20
428 Followup 15 AUG 2018 Outer 29 29/10
765 Seventh 3 AUG 2018 Other 13 13/19
500 Followup 3 JUL 2018 surface 98 98/35
428 Initial 3 JUL 2017 Outer 10 10/10
765 Initial 20 JUL 2019 Other 19 19/19
610 Third 20 AUG 2019 Invasive 66 66/17
610 Initial 17 Mar 2018 Invasive 17 17/17 ;
*Attempt;
PROC SQL;
create table want as
select a.*,
(select measurement as measurement_last_date
from have
where subject = a.subject and type = a.type
having date = max(date)) / min(a.measurement) as ratio
from have as a
group by subject, type
order by subject, type, date;
QUIT;
I think that you need use statement retain with data step.
the statement will retain your last row and you can 'll compare the last row with actual row processed.
link of some tutorial of how use statement retain.
enter link description here
SAS documentation
enter link description here

Showing data order from Monday-Sunday full week only and hide non-full week data

sorry if I'm shooting newbie questions here.
I want to create a weekly report, but for this weekly report, I want full data from Monday to Sunday
Condition:
Last 4 weeks only
Showing full week (Monday - Sunday)
Hide the result if it's not full week
If i use getdate -14, if I access the data on Wednesday, they will start counting last week from Wednesday 2 weeks ago instead of last Monday. Meanwhile, I want the report to show full week only.
Can anyone share how to do that in SQL?
Here I provide sample data:
Column name = DATE -- Column name: TOTAL_PERSON
- Fri, 1 Jun 2018 -- 10
- Sat, 2 Jun 2018 -- 4
- Sun, 3 Jun 2018 -- 12
- Mon, 4 Jun 2018 -- 15
- Tue, 5 Jun 2018 -- 10
- Wed, 6 Jun 2018 -- 3
- Thu, 7 Jun 2018 -- 1
- Fri, 8 Jun 2018 -- 13
- Sat, 9 Jun 2018 -- 9
- Sun, 10 Jun 2018 -- 23
- Mon, 11 Jun 2018 -- 5
- Tue, 12 Jun 2018 -- 3
- Wed, 13 Jun 2018 -- 1
- Thu, 14 Jun 2018 -- (TODAY)
In this case, if I am accessing on Thu 6 Jun 2018 I want to get TOTAL PERSON data from Mon, 4 Jun 2018 to Sun, 10 Jun 2018 only and not showing data from the rest since the week is not full.
Can anyone help me how to do that?
Thanks a lot!
I think you want:
where datediff(week, date, getdate()) <= 2
This counts the number of week boundaries between two dates, so it returns an entire week.
For MySQL, you can use such a select:
SELECT * FROM `myDB` WHERE `Date`
BETWEEN DATE_SUB(NOW()-INTERVAL DATE_FORMAT(CURRENT_DATE, '%w') DAY, INTERVAL 28 DAY)
AND NOW()- INTERVAL DATE_FORMAT(CURRENT_DATE, '%w') DAY
This uses the capability to transform the current day of this week into a number and substract this to get the last Sunday. from there, we select an intervall of 28 days.
(Only testet with 14 days and a very limited test-dataset, but should work)

Trying to pull the required rows from the single table with applying conditional statements on columns in sql server?

I have tried in n-number ways to solve this solution but unfortunately I got stuck in all the ways..
source table
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 12 15 16 17 18 19 20 21 22 23
1234 2013 05 06 12 15 16 17 18 19 20 21 22 23
Task: Assume that we are currently at March 2014, and we need 12 months back date ...(i.e., from Mar 2013 to Feb 2014, and the remaining values needs to be zero except year and id.)
Solution:
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 0 0 0 0 0 0 0 0 0 0
1234 2013 0 0 12 15 16 17 18 19 20 21 22 23
This needs a code solution for SQL Server 2008. I would be very happy if any body can solve this.
Note:
I got stuck to pull the column names dynamically.
You can try this:
select id, year, case when DATEDiff(month, getdate(), convert(datetime, year + '-01-01'))) < 12 then jan else 0,
DATEDiff(month, getdate(), convert(datetime, year + '-02-01'))) < 12 then fab else 0 ....