I just started learning SAS and realised that proc sql don't use window functions. As I am more at ease with sql I was wondering how I can simulate a sum window function in proc?
desired result
select a.active, a.store_id, a.nbr, sum(nbr) over (partition by a.store_id)
from(select active, store_id, count(customer_id) as nbr from customer group by active, store_id) as a
;
active
store_id
nbr
sum
0
1
8
326
1
1
318
326
0
2
7
273
1
2
266
273
eg of raw data
select active, store_id, customer_id
from customer
limit 10;
active
store_id
customer_id
1
1
1
1
1
2
1
2
3
1
2
4
1
1
5
1
1
6
0
1
7
1
2
8
1
1
9
1
2
10
current result and query
select a.active, a.store_id, a.nbr, sum(nbr)
from(select active, store_id, count(customer_id) as nbr from customer group by active, store_id) as a
group by a.active, a.store_id, a.nbr;
active
store_id
nbr
sum
0
1
8
8
1
1
318
318
0
2
7
7
1
2
266
266
Unlike some SQL implementation SAS is happy to re-merge summary statistics back onto the detail rows when you include variables that are neither group by nor summary statistics.
Let's convert your print outs of data into an actual dataset. And let's change one value so we have at least two values of ACTIVE to group by.
data have;
input active store_id customer_id;
cards;
1 1 1
1 1 2
1 2 3
1 2 4
1 1 5
1 1 6
0 1 7
1 2 8
1 1 9
1 2 10
;
Now we can count the records by ACTIVE and STORE_ID and then generate the report by appending the store total.
proc sql;
select active,store_id,nbr,sum(nbr) as store_nbr
from (select active,store_id,count(*) as nbr
from have
group by active,store_id
)
group by store_id
;
Resulting printout:
active store_id nbr store_nbr
---------------------------------------
0 1 1 6
1 1 5 6
1 2 4 4
You can do the equivalent in proc sql by merging two sub-queries: one for the count of customers by active, store_id, and another for the total customers for each store_id.
proc sql noprint;
create table want as
select t1.active
, t1.store_id
, t1.nbr
, t2.sum
from (select active
, store_id
, count(customer_id) as nbr
from have
group by store_id, active
) as t1
LEFT JOIN
(select store_id
, count(customer_id) as sum
from have
group by store_id
) as t2
ON t1.store_id = t2.store_id
;
quit;
If you wanted to do this in a more SASsy way, you can run proc means and merge together the results from a single dataset that holds everything you need. proc means will calculate all possible combinations of your variables by default.
proc means data=have noprint;
class store_id active;
ways 1 2;
output out=want_total
n(customer_id) = total
;
run;
data want;
merge want_total(where=(_TYPE_ = 3) rename=(total = nbr) )
want_total(where=(_TYPE_ = 2) rename=(total = sum) keep=_TYPE_ store_id total)
;
by store_id;
drop _:;
run;
Or, in SQL:
proc sql;
create table want as
select t1.store_id
, t1.active
, t1.total as nbr
, t2.total as sum
from want_total as t1
LEFT JOIN
want_total as t2
ON t1.store_id = t2.store_id
where t1._TYPE_ = 3
AND t2._TYPE_ = 2
;
quit;
The _TYPE_ variable identifies the level of the analysis. For example, _TYPE_ = 1 is for active only, _TYPE_ = 2 is for store_id only, and _TYPE_ = 3 is for all combinations. You can view this in the output dataset from proc means:
store_id active _TYPE_ _FREQ_ total
. 0 1 3 3
. 1 1 7 7
1 . 2 6 6
2 . 2 4 4
1 0 3 1 1
1 1 3 5 5
2 0 3 2 2
2 1 3 2 2
And if you wanted faster high-performance results, check out its big sibling, proc hpsummary.
Therein lies the cool thing about SAS: You can bounce between PROCs, SQL, the DATA Step, and Python via Pandas/proc python. You can exploit the unique benefits of each of these methods and thought processes for any number of data engineering and statistics problems.
Related
I seek to find the maximum timestamp (ob.create_ts) for each group of marketid's (ob.marketid), joining tables obe (ob.orderbookid = obe.orderbookid) and market (ob.marketid = m.marketid). Although there are a number of solutions posted like this for a single table, when I join multiple tables, I get redundant results. Sample table and desired results below:
table: ob
orderbookid
marketid
create_ts
1
1
1664635255298
2
1
1664635255299
3
1
1664635255300
4
2
1664635255301
5
2
1664635255302
6
2
1664635255303
table: obe
orderbookentryid
orderbookid
entryname
1
1
'entry-1'
2
1
'entry-2'
3
1
'entry-3'
4
2
'entry-4'
5
2
'entry-5'
6
3
'entry-6'
7
3
'entry-7'
8
4
'entry-8'
9
5
'entry-9'
10
6
'entry-10'
table: m
marketid
marketname
1
'market-1'
2
'market-2'
desired results
ob.orderbookid
ob.marketid
obe.orderbookentryid
obe.entryname
m.marketname
3
1
6
'entry-6'
'market-1'
3
1
7
'entry-7'
'market-1'
6
2
10
'entry-10'
'market-2'
Use ROW_NUMBER() to get a properly filtered ob table. Then JOIN the other tables onto that!
WITH
ob_filtered AS (
SELECT
orderbookid,
marketid
FROM
(
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY
marketid
ORDER BY
create_ts DESC
) AS create_ts_rownumber
FROM
ob
) ob_with_rownumber
WHERE
create_ts_rownumber = 1
)
SELECT
ob_filtered.orderbookid,
ob_filtered.marketid,
obe.orderbookentryid,
obe.entryname,
m.marketname
FROM
ob_filtered
JOIN m
ON m.marketid = ob_filtered.marketid
JOIN obe
ON ob_filtered.orderbookid = obe.orderbookid
;
I have following tables.
Part
id
name
1
Part 1
2
Part 2
3
Part 3
Operation
id
name
part_id
order
1
Op 1
1
10
2
Op 2
1
20
3
Op 3
1
30
4
Op 1
2
10
5
Op 2
2
20
6
Op 1
3
10
Lot
id
part_id
Operation_id
10
1
2
11
2
5
12
3
6
I am selecting the results from Lot table and I want to select a column last_Op which is based on the order value of the operation_id. If value of order for the operation_id is the highest for the respective part_id, return 1 else return 0
SELECT
id,
part_id,
operation_id,
last_Op
FROM Lot
expected result set based on the tables above.
id
part_id
operation_id
last_op
10
1
2
0
11
2
5
1
12
3
6
1
In above example, first row returns last_op = 0 because operation_id = 2 is associated with part_id = 1 and it has the highest order = 30. Since operation_id for this part is not pointing towards the highest order value, 0 is returned.
The other two rows return 1 because operation_id 5 and 6 are associated with part_id 2 and 3 respectively and they are pointing towards the highest 'order' value.
If value of order for the operation_id is the highest for the respective part_id, return 1 else return 0
This sounds like window functions will help:
select l.*,
(case when o.order = o.max_order then 1 else 0 end) as last_op
from lot l left join
(select o.*,
max(o.order) over (partition by o.part_id) as max_order
from operations o
) o
on l.operation_id = o.id;
Note: order is a very poor name for a column because it is a SQL keyword.
I have a form which have stars for rating from 1-5
i have count each user hits over individual value
Now i need to have count() of my table along with my result
i have achieved this with sub-query but for me cost of query matters that's why i am tired
to calculate count() along with my existing query.
How to add a column with existing record which calculate count of all records in table.?
I will attach my output
WITH survey AS (
SELECT
*
FROM
(
SELECT
student_id,
performance,
teacher_behaviour,
syllabus,
survey_id
FROM
survey_feedback
WHERE
survey_id = 1
GROUP BY
student_id,
performance,
teacher_behaviour,
syllabus,
survey_id
) UNPIVOT ( star
FOR q
IN ( performance AS 'PERFORMANCE',
teacher_behaviour AS 'TEACHER_BEHAVIOUR',
syllabus AS 'SYLLABUS' ) )
ORDER BY
1,
2
)
SELECT
*
FROM
survey PIVOT (
COUNT ( student_id )
FOR star
IN ( 1, 2, 3, 4, 5 )
)
ORDER BY
q;
Output is
1 PERFORMANCE 0 1 2 2 1
1 SYLLABUS 2 2 2 0 0
1 TEACHER_BEHAVIOUR 2 0 1 1 2
Desired Output is
1 PERFORMANCE 0 1 2 2 1 6 <-Want this total count of table
1 SYLLABUS 2 2 2 0 0 6
1 TEACHER_BEHAVIOUR 2 0 1 1 2 6
I have two tables:
Report
ReportId CreatedDate
1 2018-01-12
2 2018-02-12
3 2018-03-12
ReportSpecialty
SpecialtyId ReportId IsPrimarySpecialty
1 1 1
2 2 1
3 3 1
1 2 0
1 3 0
I am trying to write a query that will retrieve me the last 10 reports that were published. However, I need to get 1 report from each specialty. Assume there are 100 specialties, I can pass in as an argument any number of specialties, 10, 20, 5, 2, etc...
I'm trying to figure out a way where if I send it all specialties, it will get me the last 10 reports posted based on the last date created, but it won't give me 2 articles from same specialty. If I send it 10 specialties, then I will get 1 of each. If I send it 5, then I'll get 2 of each. If I send it 3 then I'll get 4 of 1 and 3 of other two.
I may need to write multiple queries for this, I'm trying to see if there is a way to do this on the SQL side of things? If there isn't, then how would I break down to multiple queries to get the result I want?
What I have tried is this, however I get multiple reports with same specialties:
SELECT TOP 10 r.ReportId, rs.SpecialtyId, r.CreatedDate
FROM Report r
INNER JOIN ReportSpecialty rs ON r.ReportId = rs.ReportId AND rs.IsPrimarySpecialty = 1
GROUP BY rs.SpecialtyId, r.AceReportid, r.CreatedDate
ORDER BY r.CreatedDate DESC
with cte as (
SELECT R.ReportId, R.CreatedDate, RS.SpecialtyId,
ROW_NUMBER() OVER (PARTITION BY RS.SpecialtyId
ORDER BY R.CreatedDate DESC) as rn
FROM Report R
JOIN ReportSpecialty RS
ON R.ReportId = RS.ReportId
AND RS.IsPrimarySpecialty = 1
WHERE RS.SpecialtyId IN ( .... ids ... )
)
SELECT TOP 10 *
FROM cte
ORDER BY rn, CreatedDate DESC
row_number will create a id for each speciality, so if you pass 3 speciality you will get something like this.
rn speciality_id
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
Suppose I sell services that span a time interval (days, months or even years). I have a Products table, where each product is listed, together with the Customer_ID and Service_start and Service_end date.
Now I want to list all combinations of pairs (Service_start, Service_end) inside each customer; e.g. (table sorted by Customer_ID)
Lp Service_start Service_end Customer_ID
--------------------------------------------
1 2-Feb-2014 8-Aug-2014 1
2 5-May-2014 20-Dec-2014 1
3 7-Jul-2014 9-Sep-2014 1
4 13-Jan-2014 13-Jan-2015 2
.. ... ... ...
I want to turn into
Lp Service_start Service_end Customer_ID
--------------------------------------------
1 2-Feb-2014 8-Aug-2014 1
2 2-Feb-2014 20-Dec-2014 1
3 2-Feb-2014 9-Sep-2014 1
4 5-May-2014 8-Aug-2014 1
5 5-May-2014 20-Dec-2014 1
6 5-May-2014 9-Sep-2014 1
7 13-Jan-2014 8-Aug-2014 1
8 13-Jan-2014 20-Dec-2014 1
9 13-Jan-2014 9-Sep-2014 1
10 13-Jan-2014 13-Jan-2015 2
... ... ... ...
The table is big enough that it doesn't fit into memory.
How it can be achievable by SQL? Or SAS?
You can do this in SAS and SQL. Here is the SQL idea:
select ss.service_start, se.service_end, ss.customer_id
from (select distinct customer_id, service_start from table) ss join
(select distinct customer_id service_end from table) se
on ss.customer_id = se.customer_id;
This is compatible with SAS proc sql.
In most dialects of SQL, you can add the lp column using row_number() over (order by customer_id, service_start, service_end). In SAS, you can use monotonic() or a data step after proc sql.