I am very new to SAS as well as SQL and would appreciate help.
I have a data set containing student id, term and audit_type. Each term has 2 audit_types and the student can be present at either of them or be present at both.
I need to create a flag for each of these 3 scenarios for each student id each term: 1) if the student is present at only audit_type_1, 2) if s/he is present at only audit_type_2 and 3) if s/he is present at audit_type_1 and audit_type_2 both during that term. Not sure how to post my data but here it is
Sample data
| Id | Term | Audit_type |
|---- |------------- |-----------: |
| 1 | Fall 2016 | 1 |
| 1 | Fall 2016 | 2 |
| 2 | Winter 2017 | 1 |
| 3 | Winter 2017 | 2 |
| 4 | Spring 2017 | 1 |
| 4 | Spring 2017 | 2 |
I was able to create a flag for the first 2 scenarios using case when as seen below:
proc sql;
create table test as
select id, term, audit_type,
case
when audit_type in ('audit_type_1') then 1
when audit_type in ('audit_type_2 ') then 2
end as audit_type_flag
from have;
I can't figure out how to flag the third scenario. All help will be highly appreciated. Thanks in advance for your help and support. So, I want something like below:
| Id | Term | Audit_type | Flag |
|---- |------------- |-----------: |------ |
| 1 | Fall 2016 | 1 | 3 |
| 1 | Fall 2016 | 2 | 3 |
| 2 | Winter 2017 | 1 | 1 |
| 3 | Winter 2017 | 2 | 2 |
| 4 | Spring 2017 | 1 | 3 |
| 4 | Spring 2017 | 2 | 3 |
I am sort of reading between the lines here and assumed that you are using the logic Audit_Type = 1 then Flag = 1, Audit_Type = 2 then Flag = 2, if it is both then 3 so I just added the flags together and gained 3. This may not be what you are after (you may want a 4, 5, 6 and 7 flag as well), it was just an assumption based on a small amount of data and not knowing the exact use case, therefore, I will provide a solution and if it is correct please let me know and I will add comments explaining the syntax. I don't want to spend time explaining the code if it isn't precisely what you are looking for.
Regards,
Scott
UPDATE
I have added the comments to the code as well as links to pages which may help you better understand what I am talking about.
/* SETUP SOME DUMMY DATA */
DATA HAVE;
LENGTH ID 3. TERM $11. AUDIT_TYPE 3.;
INFILE DATALINES DSD DELIMITER = "," missover;
INPUT ID TERM AUDIT_TYPE;
DATALINES;
1,Fall 2016,1
1,Fall 2016,2
1,Summer 2016,1
1,Summer 2016,2
2,Winter 2017,1
3,Winter 2017,2
4,Spring 2017,1
4,Spring 2017,2
;
RUN;
/* PERFORM A SORT SO THAT WE CAN MAKE USE OF BY STATEMENT PROCESSING IN THE
SUBSEQUENT DATA STEP */
PROC SORT DATA = HAVE;
BY ID TERM;
RUN;
DATA WANT;
/* DOW LOOP */
/* THIS LOOP EXECUTES FOR EACH ROW UNTIL THE LAST.TERM IS ENCOUNTERED. */
/* COMMENTING OUT THE SECOND DOW LOOP WILL SHOW YOU THAT THIS LOOP IS
BASICALLY SUMMARISING THE RESULT OF EACH ID, TERM GROUP AFTER THE CONDITIONAL
LOGIC IS APPLIED TO THE FLAG VARIABLE.*/
DO UNTIL (LAST.TERM);
/* EACH TIME THE DO LOOP EXECUTES A NEW ROW IS READ INTO THE PDV (PROGRAM DATA VECTOR).*/
SET HAVE;
/* THE BY STATEMENT IS IN PLACE TO FACILITATE BY STATEMENT PROCESSING. */
BY ID TERM;
/* INITIALISE THE FLAG VARIABLE EACH TIME A NEW TERM IS ENCOUNTERED */
IF FIRST.TERM THEN FLAG = 0;
/* USING THE SYNTAX FLAG + 1 REPLICATES USING THE RETAIN STATEMENT WITH A SUM FUNCTION.
IT IS REFERRED TO AS THE SUM STATEMENT. IF YOU ARE INTERESTED IN LEARNING MORE ABOUT
THIS THEN SEE THE LINK TO THE DOCUMENTATION BELOW.*/
IF AUDIT_TYPE = 1 THEN FLAG + 1;
ELSE IF AUDIT_TYPE = 2 THEN FLAG + 2;
END;
/* ONCE THE PREVIOUS DOW LOOP EXITS (BECAUSE THE LAST.TERM HAS BEEN REACHED) THE SECOND
DOW LOOP EXECUTES*/
DO UNTIL (LAST.TERM);
/* AS PER DOW LOOP 1, EACH LOOP RESULTS IN A SINGLE OBSERVATION BEING READ */
SET HAVE;
/* THE BY STATEMENT IS IN PLACE TO FACILITATE BY STATEMENT PROCESSING. */
BY ID TERM;
/* THE EXPLICIT OUTPUT STATEMENT EXECUTES TO OUTPUT THE VALUES CONTAINED WITHIN THE
PDV */
OUTPUT;
END;
RUN;
Further reading:
The DOW-loop: a Smarter Approach to your Existing Code
The Sum Statement
The Power of the BY Statement
just use additional else with case it will be solved
select id, term, audit_type,
case
when audit_type in ('audit_type_1') then 1
when audit_type in ('audit_type_2 ') then 2
else 3
end as audit_type_flag
from have;
I think you need aggregation:
proc sql;
create table test as
select id, term, audit_type,
(case when min(audit_type) <> max(audit_type) then 1
when min(audit_type) = 2 then 2
when min(audit_type) = 3 then 3
end) as audit_type_flag
from have
group by id, term;
Related
I am having trouble figuring this out... I am trying to write an SQL query to subtract 2 values from this table:
RateTable:
RecordID | Policy | Benefit | CBBR | IBBR
---------+--------+---------+-------+-------
1 | 12345 | A | $1.34 | $5.64
2 | 12345 | B | $4.56 | $0.56
3 | 12345 | C | $5.67 | $3.32
4 | 54321 | A | $2.57 | $6.24
5 | 34512 | A | $1.76 | $3.32
6 | 34512 | A | $4.56 | $1.34
I need to create a query that will return the result from the value in CBBR where Policy = 12345 and benefit = A then subtract the value in IBBR where Policy = 12345 and benefit = B ($1.34 - 0.56)
Any ideas?
I am not clear that you want the diff for only 1 value "B" or the pattern will go on if condition (Policy = 12345 and benefit ='A') to get the diff of next row.
ASSUMING diff you want to calculate for all next rows
select *,case when Benefit='A'AND Policy=12345 then CBBR-new_IBBR else CBBR end as diff from(
select *,lead(IBBR,1) over (order by RecordID ASC) as new_IBBR from <TABLE NAME>)x
Ankit Jindal has already gave a correct answeh. However, I'd like to note that there's no need to use any subquery as they can significantly slown down the performance. In this particular case, a JOIN operator is enough:
SELECT
rta.CBBR-rtb.IBBR as Result
FROM RateTable as rta
JOIN RateTable as rtb ON rtb.Policy=rta.Policy AND rtb.Benefit='B'
WHERE Policy=12345
AND Benefit='A'
As per the details mentioned in query, you need difference between CBBR & IBBR for particular conditions.
SELECT CBBR-
(SELECT IBBR
FROM RateTable
WHERE Policy = 12345
AND benefit = 'B') AS IBBR
FROM RateTable
WHERE Policy = 12345
AND benefit = 'A';
But if you need generalized query then we probably gonna use SUM or something else.
Your question requires more explanation, based on my understanding I think you should use case statement as-
Select
Case when (Policy = 12345 and benefit = A) then CBBR
When (Policy = 12345 and benefit = B) then CBBR-IBBR
END as Value
From Yourtable
I have event data that looks like this:
id | instance_id | value
1 | 1 | a
2 | 1 | ap
3 | 1 | app
4 | 1 | appl
5 | 2 | b
6 | 2 | bo
7 | 1 | apple
8 | 2 | boa
9 | 2 | boat
10 | 2 | boa
11 | 1 | appl
12 | 1 | apply
Basically, each row is a user typing a new letter. They can also delete letters.
I'd like to create a dataset that looks like this, let's call it data
id | instance_id | value
7 | 1 | apple
9 | 2 | boat
12 | 1 | apply
My goal is to extract all the complete words in each instance, accounting for deletion as well - so it's not sufficient to just get the longest word or the most recently typed.
To do so, I was planning to do a regex operation like so:
select * from data
where not exists (select * from data d2 where d2.value ~ (d.value || '.'))
Effectively I'm trying to build a dynamic regex that adds matches one character more than is present, and is specific to the row it's matching against.
The code above doesn't seem to work. In Python, I can "compile" a regex pattern before I use it. What is the equivalent in PostgreSQL to dynamically build a pattern?
Try simple LIKE operator instead of regex patterns:
SELECT * FROM data d1
WHERE NOT EXISTS (
SELECT * FROM data d2
WHERE d2.value LIKE d1.value ||'_%'
)
Demo: https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=cd064c92565639576ff456dbe0cd5f39
Create an index on value column, this should speed up the query a bit.
To find peaks in the sequential data window functions is a good choice. You just need to compare each value with previous and next ones using lag() and lead() functions:
with cte as (
select
*,
length(value) > coalesce(length(lead(value) over (partition by instance_id order by id)),0) and
length(value) > coalesce(length(lag(value) over (partition by instance_id order by id)),length(value)) as is_peak
from data)
select * from cte where is_peak order by id;
Demo
so this is where I realize the difference between theory and practice. Because while I can theoretically picture how it should be/look I can't for the life of me actually figure out how to actually do it. I have tens of thousands of observations that look like this:
>+--------+-------------------------------+--+
>| ID | CALLS | |
>+--------+-------------------------------+--+
>| 162743 | BAD DVR-3|NO PIC-1 | |
>| 64747 | NO PIC-1|BOX HIT-4|PPV DROP-1 | |
>+--------+-------------------------------+--+
And the end results should be something like this:
+--------+---------+--------+---------+----------+--+
| ID | BAD DVR | NO PIC | BOX HIT | PPV DROP | |
+--------+---------+--------+---------+----------+--+
| 162743 | 3 | 1 | 0 | 0 | |
| 64747 | 0 | 1 | 4 | 1 | |
+--------+---------+--------+---------+----------+--+
I'm using PLSQL passthru in SAS so if I need to do transposing I can also always use proc transpose. But getting to that point is quite honestly beyond me. I know I will probably have to create a function likie the question asked here:T-SQL: Opposite to string concatenation - how to split string into multiple records
Any ideas?
Do you have any reference material that describes all the possible values for those PIPE delimited values in the CALLS column? Or do you already know the particular values you need to keep and can ignore others?
If so, you can just process the entire thing in a data step; here is an example:
data have;
input #1 ID 6. #9 CALLS $50.;
datalines;
162743 BAD DVR-3|NO PIC-1
64747 NO PIC-1|BOX HIT-4|PPV DROP-1
run;
data want;
set have; /* point to your Oracle source here */
length field $50;
idx = 1;
BAD_DVR = 0;
NO_PIC = 0;
BOX_HIT = 0;
PPV_DROP = 0;
do i=1 to 5 while(idx ne 0);
field = scan(calls,idx,'|');
if field = ' ' then idx=0;
else do;
if field =: 'BAD DVR' then BAD_DVR = input(substr(field,9),8.);
else if field =: 'NO PIC' then NO_PIC = input(substr(field,8),8.);
else if field =: 'BOX HIT' then BOX_HIT = input(substr(field,9),8.);
else if field =: 'PPV DROP' then PPV_DROP = input(substr(field,10),8.);
idx + 1;
end;
end;
output;
keep ID BAD_DVR NO_PIC BOX_HIT PPV_DROP;
run;
The SCAN function steps through the CALLS column by token; The ":=" operator is "begins with", and the SUBSTR function with only two parameters finds the characters following the hyphen to be read by the INPUT function.
Of course, I'm making a few assumptions about your source data but you get the idea.
I can think of at least two ways to achieve this:
1. Read the entire data from SQL into SAS. Then use DATA STEP to manipulate the data i.e.,
convert data that is in two columns:
>+--------+-------------------------------+--+
>| ID | CALLS | |
>+--------+-------------------------------+--+
>| 162743 | BAD DVR-3|NO PIC-1 | |
>| 64747 | NO PIC-1|BOX HIT-4|PPV DROP-1 | |
>+--------+-------------------------------+--+
to something that looks like this:
result of DATA STEP manipulation:
ID CALLS COUNT
162743 BAD_DVR 3
162743 NO_PIC 1
64747 NO_PIC 1
64747 BOX_HIT 4
64747 PPV_DROP 1
From then it would be a simple matter of passing the above dataset to PROC TRANSPOSE
to get a table like this:
+--------+---------+--------+---------+----------+--+
| ID | BAD DVR | NO PIC | BOX HIT | PPV DROP | |
+--------+---------+--------+---------+----------+--+
| 162743 | 3 | 1 | 0 | 0 | |
| 64747 | 0 | 1 | 4 | 1 | |
+--------+---------+--------+---------+----------+--+
If you want to do everything in pass-through SQL then that to should be easy IF the no. of categories such as {BAD DVR, NO PIC, BOX HIT etc...} are small.
The code will look like:
SELECT
ID
,CASE WHEN SOME_FUNC_TO_FIND_LOCATION_OF_SUBSTRING(CALLS, 'BAD DVR-')>0 THEN <SOME FUNCTION TO EXTRACT EVERYTHING FROM - TO |> ELSE 0 END AS BAD_DVR__COUNT
,CASE WHEN SOME_FUNC_TO_FIND_LOCATION_OF_SUBSTRING(CALLS, 'NO PIC-')>0 THEN <SOME FUNCTION TO EXTRACT EVERYTHING FROM - TO |> ELSE 0 END AS NO_PIC__COUNT
,<and so on>
FROM YOUR_TABLE
You just need to look string manipulation functions available in your database to make everything work.
I have a table that is a list of paths between points. I want to create a query to return a list with pointID and range(number of point) from a given point. But have spent a day trying to figure this out and haven't go any where, does any one know how this should be done? ( I am writing this for MS-SQL 2005)
example
-- fromPointID | toPointID |
---------------|-----------|
-- 1 | 2 |
-- 2 | 1 |
-- 1 | 3 |
-- 3 | 1 |
-- 2 | 3 |
-- 3 | 2 |
-- 4 | 2 |
-- 2 | 4 |
with PointRanges ([fromPointID], [toPointID], [Range])
AS
(
-- anchor member
SELECT [fromPointID],
[toPointID],
0 AS [Range]
FROM dbo.[Paths]
WHERE [toPointID] = 1
UNION ALL
-- recursive members
SELECT P.[fromPointID],
P.[toPointID],
[Range] + 1 AS [Range]
FROM dbo.[Paths] AS P
INNER JOIN PointRanges AS PR ON PR.[toPointID] = P.[fromPointID]
WHERE [Range] < 5 -- This part is just added to limit the size of the table being returned
--WHERE P.[fromPointID] NOT IN (SELECT [toPointID] FROM PointRanges)
--Cant do the where statment I want to because it wont allow recurssion in the sub query
)
SELECT * FROM PointRanges
--Want this returned
-- PointID | Range |
-----------|-------|
-- 1 | 0 |
-- 2 | 1 |
-- 3 | 1 |
-- 4 | 2 |
Markus Jarderot's link gives a good answer for this. I end tried using his answer and also tried rewriting my problem in C# and linq but it ended up being more of a mathematical problem than a coding problem because I had a table with several thousands of point that interlinked. This is still something I am interested in and trying to get a better understanding of by reading books on mathematics and graph theory but if anyone else runs into this problem I think Markus Jarderot's link is the best answer you will find.
I am doing some work on an inbound call demand capture system where each call could have one or more than one demands linked to it.
There is a CaptureHeader table with CallDate, CallReference and CaptureID and a CaptureDemand table with CaptureID and DemandID.
EDIT:
I have added some representative data to show what would be expected in each table.
CaptureHeader
CaptureID | CallReference | CallDate
-----------------------------------------------
1 | 1 | 2009-11-02 20:37:00
2 | 3 | 2009-11-02 20:37:05
3 | 2 | 2009-11-02 20:37:10
4 | 4 | 2009-11-02 20:38:00
5 | 5 | 2009-11-02 20:38:30
CaptureDemand
DemandID | CaptureID | DemandText
------------------------------------
1 | 1 | Fund value
2 | 2 | Password reset
3 | 2 | Fund value
4 | 3 | Change address
5 | 3 | Fund value
6 | 3 | Rate change
7 | 3 | Fund value
8 | 4 | Variable to fixed
9 | 4 | Change address
10 | 5 | Fund value
11 | 5 | Address change
Using the tables above a filter on 'Fund value' would bring back call references of 1, 2, 3, 3, 5 because 3 has two fund values.
If I did a DISTINCT on this because I have ordered by date it would ask me to show that which would also give me two lines for 3.
To get the full set of data I would do the following query:
SELECT * FROM CaptureHeader AS ch
JOIN CaptureDemand AS cd ON ch.CaptureID = cd.CaptureID
JOIN DemandDetails AS dd ON cd.DemandID = dd.DemandID
What I would like though is to get the last 100 headers by date for a particular demand. Where it gets tricky is when there is more than one of the same demand on a header for a particular reference which is possible.
I would like 100 unique call references because I then need to get back all the demands for those call references and then count how many of each other demand was also recorded in the same call.
EDIT:
I would like to be able to say 'WHERE DemandID = SomeValue' to select my 100 references.
In other words out of 100 "value requested" demands what else was asked for. If this doesn't make sense let me know and I will try and modify the question to be clearer.
I would like to get a table like this:
Demands | Count
------------------------
Demand asked for | 100
Another demand | 36
Third demand | 12
Fourth demand | 6
Cheers, Ian.
Now that the sample data made your requirement more explicit, I believe the following will generally server your needs. It is essentially the same as previous submission, with an added condition on the JOIN; this condition essentially excludes any CaptureDemand row for which we readily have the same DemandText (within the same Capture), only retaining the one with the lowest DemandId.
WITH myCTE (CaptId, NbOfDemands)
AS (
SELECT CaptureID, COUNT(*) -- Can use COUNT(DISTINCT DemandText)
FROM CaptureDemand
WHERE CaptureID IN
(SELECT TOP 100 C.CaptureID
FROM CaptureHeader C
JOIN CaptureDemand D ON C.CaptureID = D.CaptureID
AND NOT EXISTS (
SELECT * FROM CaptureDemand X
WHERE X.CaptureId = D.CaptureId AND X.DemandText = D.DemandText
AND X.DemandId < D.DemandId
)
WHERE D.DemandText= 'Fund Value'
ORDER BY CallDate DESC)
)
SELECT NbOfDemands, COUNT(*)
FROM myCTE
GROUP BY NbOfDemands
ORDER BY NbOfDemands
What this query provides:
The number of Captures which had exactly one demand
The number of Captures which had exactly two demands
..
The number of Captures which had exactly n demands
For the 100 MOST RECENT Captures which included a Demand of a particular value 'someValue' (and, this time, giving indeed 100, i.e. not counting the same CaptureID twice in case of dups on the Demand Type).
A few points:
You may want to use COUNT(DISTINCT DemandText) rather than COUNT(*) in the select list of the CTE. (We do include 100 distinct CaptureIDs, i.e. that the Capture #3 in your sample doesn't come twice and hence hiding another capture at the end of the list, but we need to know if this #3 Capture should be counted as 3 Demands or a 4 Demands capture).
Oops, not quite what you required because each line show the number of Capture instances that have exactly this amount of demands...
use a CASE on NbOfDemands to display the text as in the question (trivial)
This may show Capture instances with more than 4 demands, but that's probably a plus (if any), but that is probably a plus
This would not show 0 if for example there were no Capture instances with the given number of demands.
It sounds like you are trying to solve a Many to Many problem with just two tables and you really need three tables. For example:
TABLE Calls
CallId | CallDate
----------------------------
1 | 2009-11-02 20:37:00
2 | 2009-11-02 20:37:05
3 | 2009-11-02 20:37:10
4 | 2009-11-02 20:38:00
5 | 2009-11-02 20:38:30
TABLE Requests
RequestId | RequestType
----------------------------
1 | Fund value
2 | Password reset
3 | Change address
4 | Rate change
5 | Variable to fixed
TABLE CallRequests (resolves the many to many)
CallId |RequestId
-----------------
1 |1
2 |2
2 |1
3 |3
3 |1
3 |4
3 |1
4 |5
4 |3
5 |1
5 |3
This data structure will let you query from the Call side of things and from the Request side of things.