how to find a missing case using proc sql in sas? - sql

I would like to use proc sql in sas to identify if a case or record is missing some information. I have two datasets. One is a record of an entire data collection, that shows what forms have been collected during a visit. The second is a specification of what forms should be collected during a visit. I have tried many options including data steps and sql code using not in to no avail...
Example data is below
***** dataset crf is a listing of all forms that have been filled out at each visit ;
***** cid is an identifier for a study center ;
***** pid is an identifier for a participant ;
data crf;
input visit cid pid form ;
cards;
1 10 101 10
1 10 101 11
1 10 101 12
1 10 102 10
1 10 102 11
2 10 101 11
2 10 101 13
2 10 102 11
2 10 102 12
2 10 102 13
;
run;
***** dataset crfrule is a listing of all forms that should be filled out at each visit ;
***** so, visit 1 needs to have forms 10, 11, and 12 filled out ;
***** likewise, visit 2 needs to have forms 11 - 14 filled out ;
data crfrule;
input visit form ;
cards;
1 10
1 11
1 12
2 11
2 12
2 13
2 14
;
run;
***** We can see from the two tables that participant 101 has a complete set of records for visit 1 ;
***** However, participant 102 is missing form 12 for visit 1 ;
***** For visit 2, 101 is missing forms 12 and 14, whereas 102 is missing form 14 ;
***** I want to be able to know which forms were **NOT** filled out by each person at each visit (i.e., which forms are missing for each visit) ;
***** extracting unique cases from crf ;
proc sql;
create table visit_rec as
select distinct cid, pid, visit
from crf;
quit;
***** building the list of expected forms by visit number ;
proc sql;
create table expected as
select x.*,
y.*
from visit_rec as x right join crfrule as y
on x.visit = y.visit
order by visit, cid, pid, form;
quit;
***** so now I have a list of which forms that **SHOULD** have been filled out by each person ;
***** now, I just need to know if they were filled out or not... ;
The strategy I have been trying is to merge expected back onto the crf table with some indicator of which forms are missing for each visit.
Optimally, I would like to produce a table that would have: visit, cid, pid, missing_form
Any guidance is greatly appreciated.

EXCEPT will do what you want. I don't necessarily know that this is the most efficient solution in general (and if you're doing this in SAS, it's almost certainly not), but given what you've done so far, this does work:
create table want as
select cid,pid,visit,form from expected
except select cid,pid,visit,form from crf
;
Just be careful with EXCEPT - it's very picky (note that select * doesn't work, as your tables are in different orders).

I suggest a nested query, which alternatively can be done in two steps. What about this one:
proc sql;
create table temp as
select distinct c.*
, (d.visit is null and d.form is null and d.pid is null) as missing_form
from (
select distinct a.pid, b.* from
crf a, crfrule b
) c
left join crf d
on c.pid = d.pid
and c.form = d.form
and c.visit = d.visit
order by c.pid, c.visit, c.form
;
quit;
It gives you a list with all possible (i.e. expected) combinations of pid, form, visit and a boolean indicating whether it was present or not.

You could use a left join and use the where clause to filter out the records with missing records in the right table.
select
e.*
from
expected e left join
crf c on
e.visit = c.visit and
e.cid = c.cid and
e.pid = c.pid and
e.form = c.form
where c.visit is missing
;

Related

Select top n (variable) for each criteria in a table based on another table

I want a VBA code to make a query to show Equip with Top ActiveTime for each ModelID (from 1st table) based on TopN for each ModelID (from the 2nd table), I know i have to use QueryDef and Sql VBA but I can't figure how to write the code
Just and Example to illustrate
My 1st table is
EquipID
Equip
ActimeTime
ModelID
1
DT1
10
1
2
DT2
6
1
3
DT3
13
1
4
DT4
15
1
5
DT5
16
2
6
DT6
12
2
7
DT7
6
2
8
DT8
13
2
My 2nd Table is
ModelID
Model
TopN
1
775
3
2
789
2
So the query result should be like (Showing the Top 3 of 775 Model and the Top 2 of 789)
Equip
ActimeTime
Model
DT4
15
775
DT3
13
775
DT1
10
775
DT5
16
789
DT8
13
789
Thanks a lot in advance, I'm really stuck at this one and solving this will help me a lot in my project
[Table1][1]
[1]: https://i.stack.imgur.com/geMca.png
[Table2][2]
[2]: https://i.stack.imgur.com/lMPDP.png
[Query Result][3]
[3]: https://i.stack.imgur.com/cGf6k.png
You can do it in straight SQL - but oooh is it ugly to follow and construct
I created 4 queries with the final one resulting in what you're looking for.
The key was to get a RowID based on the sorted order you're looking for (Model and ActimeTime). You can get a pseudo Row ID using Dcount
Here's the 4 queries - I'm sure you can make one mashup if you're daring
My tables are Table3 and Table4 - you can change them in the first query to match your database. Build these queries in order as they are dependent on the one before them
qListModels
SELECT Table3.Equip, Table3.ActimeTime, Table4.Model, Table4.TopN, "" & [Model] & "-" & Format([ActimeTime],"000") AS [Model-ActTime]
FROM Table3 INNER JOIN Table4 ON Table3.ModelID = Table4.ModelID
ORDER BY Table4.Model, Table3.ActimeTime DESC;
qListModelsInOrder
SELECT qListModels.*, DCount("[Model-ActTime]","[qListModels]","[Model-ActTime]>=" & """" & [Model-ActTime] & """") AS row_id
FROM qListModels;
qListModelStartRows
SELECT qListModelsInOrder.Model, Min(qListModelsInOrder.row_id) AS MinOfrow_id
FROM qListModelsInOrder
GROUP BY qListModelsInOrder.Model;
qListTopNModels
SELECT qListModelsInOrder.Equip, qListModelsInOrder.ActimeTime, qListModelsInOrder.Model
FROM qListModelsInOrder INNER JOIN qListModelStartRows ON qListModelsInOrder.Model = qListModelStartRows.Model
WHERE ((([row_id]-[MinOfrow_id])<[TopN]))
ORDER BY qListModelsInOrder.Model, qListModelsInOrder.ActimeTime DESC;
This last one can be run anytime to get the results you want
Example Output:

SAS EG (SQL) deleting rows where max value in one column

I need to delete all the rows with a max value of duty_perd_id where the rotn_prng_nbr and empl_nbr are the same (not the same to each other, but the max where of all of the rows where those two remain constant). From the table below it should delete rows 3,7 and 9.
rotn_prng_nbr
empl_nbr
duty_perd_id
B93
12
1
B93
12
2
B93
12
3
B21
12
1
B21
12
2
B21
12
3
B21
12
4
B21
18
1
B21
18
2
using SAS EG. Right now all have is below:
Option 1:
create table middle_legs as
select t.*
from actual_flt_leg as t
where t.duty_perd_id < (select max(t2.duty_perd_id)
from actual_flt_leg as t2
where t2.rotn_prng_nbr = t.rotn_prng_nbr and
t2.empl_nbr = t.empl_nbr
);
this works exactly as intended, but is incredibly slow. The other thought that I had but couldnt quite finish was as follows.
Option 2:
create table last_duty_day as
Select * from actual_flt_leg
inner join (
select actual_flt_leg.Rotn_Prng_Nbr,actual_flt_leg.empl_nbr, max(duty_perd_id) as last_duty
from actual_flt_leg
group by actual_flt_leg.Rotn_Prng_Nbr, actual_flt_leg.empl_nbr
) maxtable on
actual_flt_leg.Rotn_Prng_Nbr = maxtable.Rotn_Prng_Nbr
and actual_flt_leg.empl_Nbr = maxtable.empl_Nbr
and actual_flt_leg.duty_perd_id = maxtable.last_duty;
option 2 finds all the highest duty_perd_id for the given pair, and I was wondering if there was any "reverse join" that could only show the rows from the original table that do not match this new table i created in option 2.
If there is a way to make option 1 faster, finish option 2, or anything else i cant think of id appreciate it. Thanks!
You are almost there. You just want <:
Select *
from actual_flt_leg inner join
(select actual_flt_leg.Rotn_Prng_Nbr,actual_flt_leg.empl_nbr, max(duty_perd_id) as last_duty
from actual_flt_leg
group by actual_flt_leg.Rotn_Prng_Nbr, actual_flt_leg.empl_nbr
) maxtable
on actual_flt_leg.Rotn_Prng_Nbr = maxtable.Rotn_Prng_Nbr and
actual_flt_leg.empl_Nbr = maxtable.empl_Nbr and
actual_flt_leg.duty_perd_id < maxtable.last_duty;
In SAS SQL, this is pretty easy:
data have;
input rotn_prng_nbr $ empl_nbr duty_perd_id;
datalines;
B93 12 1
B93 12 2
B93 12 3
B21 12 1
B21 12 2
B21 12 3
B21 12 4
B21 18 1
B21 18 2
;;;;
run;
proc sql;
select *
from have
group by rotn_prng_nbr, empl_nbr
having duty_perd_id lt max(duty_perd_id);
quit;
This isn't legit SQL in any other system that I've ever seen, but it works in SAS. You can group by a set of variables while still using select for all of the variables including ones not on group by; SAS just does two queries and merges them behind the scenes for you.
NOTE: The query requires remerging summary statistics back with the original data.
As far as I understand the actual under the hood result is exactly identical to the more "compatible" version Gordon suggests; just a matter of whether you prefer typing less or more compatible SQL code.

MS access join two tables, get unique rows

This is a modification to a previous question original answer , I hope the proper thing to do is start a new thread.
I have a table called Parts, PartRefID is the PK
PartRefID PartDefID AssemblyID
1 2 c63df10b-8250-4aa5-9889-9e8046331dbf
11 1 db51f4a8-3ffa-41f7-81c1-a9accbbb299a
67 6 136fc5d8-7b65-41b5-bca3-7d4180a1e0ab
77 5 38fa8b7a-2945-4546-8eab-7865a1e515b2
133 2 c63df10b-8250-4aa5-9889-9e8046331dbf
134 6 136fc5d8-7b65-41b5-bca3-7d4180a1e0ab
I need to extract rows with a unique AssemblyID. This was answered by GMB with the following sql:
select *
from parts as p
where [PartRefID] = (
select max(p1.[PartRefID])
from parts as p1
where p1.[AssemblyID] = p.[AssemblyID] and p1.[PartDefID] = 2
)
which worked beautifully. However requirements have changed and I must ignore the PartDefID field and there could also be AssemblyID's which represent parts I do not want.
The AssemblyID's shown in the above table represent an electrical connector part.
Electrical connector parts will ALWAYS have a Partclass of 1 which is defined in another table called PartDefinitions shown here:
PartDefID PartClass PartNumber
1 1 MS27467T23F55P
2 1 330-00186-09
3 2 336-00024-00
4 2 336-00022-00
5 1 MS27468T23F55S
6 1 330-00184-09
with my limited sql knowledge I decided a join was necessary and came up with the following code:
SELECT Parts.*, PartDefinitions.PartClass
From PartDefinitions
INNER Join Parts
On PartDefinitions.PartDefID = Parts.PartDefID
Where (((PartDefinitions.PartClass) = 1))
this gets me close, it produces all the parts in the parts table which are connectors. However there are some duplicate AssemblyID's.
what I need is to produce the following:
PartRefID PartDefID AssemblyID
1 2 c63df10b-8250-4aa5-9889-9e8046331dbf
11 1 db51f4a8-3ffa-41f7-81c1-a9accbbb299a
67 6 136fc5d8-7b65-41b5-bca3-7d4180a1e0ab
77 5 38fa8b7a-2945-4546-8eab-7865a1e515b2
my apologies if I have not made a clear and concise question
thanks for any help
and thanks again GMB
If I understand correctly, you want to filter by the PartClassId in both the subquery and the outer query:
select p.*, pd.PartClass
From Parts as p inner join
PartDefinitions as pd
on pd.PartDefID = p.PartDefID
where pd.PartClassId = 1 and
p.pPartRefID = (select max(p2.pPartRefID)
from parts as p2 inner join
PartDefinitions as pd2
on pd2.PartDefID = pd.PartDefID
where p2.AssemblyID = p.AssemblyID and
p2.PartClassId = 1
)

SQL select command SUM across 3 related tables

I've changed my DB structure to make it more future proof. Now I'm having trouble with the new select query.
I have table called activities that has a list of activities and how many steps per minute that activity was worth. The table was structred like this:
Activities
id act_name act_steps
12 Boxing 250
14 Karate 300
17 Yoga 89
I have another table called distance that is structed like this:
Distance
id dist_activity_id dist_activity_duration member_id
1 12 60 12
2 14 90 12
3 17 30 12
I have the query that would SUM and produce a total for all activities in the distance table
SELECT ROUND(SUM(act_steps * dist_activity_duration / 2000),2) AS total_miles
FROM distance,
activities
WHERE activities.id = distance.dist_activity_id
This worked fine.
To future proof it incase the number of steps for an activity changes I've setup a table called steps that is structured like this:
Steps
id activity_steps
1 6
2 250
3 300
4 89
I then updated the activities table, removing the act_steps column and replacing it with steps_id so it now looks like this:
Updated activities
id act_name steps_id
12 Boxing 2
14 Karate 3
17 Yoga 4
I'm not sure how to create the select command to get the SUM using the new structure.
Could someone please help me with this?
Thanks
Wayne
Learn to use proper JOIN syntax! Your query should look like:
SELECT ROUND(SUM(a.act_steps * d.dist_activity_duration / 2000), 2) AS total_miles
FROM distance d JOIN
activities a
ON a.id = d.dist_activity_id;
If you need to lookup the steps, then add another JOIN:
SELECT ROUND(SUM(s.activity_steps * d.dist_activity_duration / 2000), 2) AS total_miles
FROM distance d JOIN
activities a
ON a.id = d.dist_activity_id JOIN
steps s
ON s.id = a.steps_id;

Using Proc SQL to join two datasets with 2 matching variables

I have two datasets A & B. I want to join them against two fields: ID and End of Month date. This is defined as EOMDate in dataset A and BalDate in dataset B. How do I join them so that ID and the dates match with each other?
Tom's comment works. Here are a few worked samples:
/*Create some input data for the samples...*/
data first;
input id_a id_b data $;
cards;
1 1 A
2 2 B
3 33 C
4 4 D
55 5 E
;
run;
data second;
input id_a id_b data2 $;
cards;
1 1 AA
2 2 BB
3 3 CC
4 4 DD
5 5 EE
;
run;
/*The proc sql way. We create table 'combo' as result. */
/*You can add more conditions than one. */
proc sql noprint;
create table combo as
select * from first join second
on first.id_a=second.id_a and first.Id_b=second.id_b;
quit;
I've noticed that proc sql is quite slow when working with large sets.
This is a way to make the same with data statements.
First you need to sort the data.
/*A way to accomplish this with datasets.*/
proc sort data=first; by id_a id_b; run;
proc sort data=second; by id_a id_b; run;
data Combo_sas;
merge first(in=a) second(in=b);
by id_a id_b;
if a and b;
run;