Conditionally flag rows from one group based on data from another group - conditional-statements

General problem: I do not understand how to create a value based on a condition from other groups. I would like to do something like :
gen x = cond(cond1==1 & cond2==1, value[**of some other row in a different group**], other_value)
Specific problem: Given a massive data set that has groups based on id that never changes, and a secondary id (co_id) that does. Each group has multiple rows that repeat in time. Each row has a flag (is_a) that indicates a relationship to another group (i.e., id) at a certain time. The relationship is indicated by a change to co_id so it equals to the other group.
I am trying to do two things:
for the flagged rows (is_a == 1) find the id of the group that the new co_id belongs to, and
For that other group, flag the year that the connection was made
In the example above, group 111 was connected to group 222 in time 11 (a connection is made only once). Based on the new co_id 'xzx' I want to indicate the connected id of 222 from that time onward. Note that other groups can have that co_id, but the correct one is the earliest appearance of that co_id in the data (so the one is 222 and not 777).
For group 222 then I flag that time when the connection was made (time == 11).
The Sample Data:
clear
input int id byte(is_a time) str3 co_id
111 0 10 "abc"
111 0 10 "abc"
111 1 11 "xzx"
111 1 11 "xzx"
111 1 12 "xzx"
111 1 12 "xzx"
222 0 10 "xzx"
222 0 10 "xzx"
222 0 11 "xzx"
222 0 11 "xzx"
222 0 12 "xzx"
222 0 12 "xzx"
777 1 13 "xzx"
end
Thank you in advance!

I think one of the problems was that I was mentally stuck on using id as the basis for my grouping operations. Using co_id helped here, with some per-sorting:
sort co_id time id
by co_id: gen id_co = id[1] if is_a==1
Than create a helper variable to check if the co_id changed since last time:
sort id time
by id: gen changed_co_id = cond(co_id[_n]!= co_id[_n-1], 1, 0)
by id: replace changed_co_id = 0 if _n==1
by id time: replace changed_co_id = 1 if changed_co_id[1]==1
Now I can create the flag for the other group to indicate when the connection was made:
#delimit ;
sort co_id time is_a changed_co_id;
by co_id time: gen is_conn = cond(is_a==0 &
changed_co_id==0 &
is_a[_N]==1 &
changed_co_id[_N]==1, 1, 0);
#delimit cr
So in order to create the flag we need to sort by co_id so we can get both ids of the connected groups, by time so they can coexist at the time of the connection (and onward) and by changed_co_id so we can find the exact time in which the connection was made - also, this arrangement makes sure that the newly connected observations appear in the end of each group. Then we flag the observations of the group that initiated the connection: if they are not connected themselves and the last observation is, and newly so.

Related

Oracle SQL - Need to eliminate data if at least one of the particular condition is not satisfied

My question is related to Oracle sql. I have a two tables say, study table and another one is study part table. Stdyno is the primary key in study table and (stydyno + sqncno) is the primary key in studypart table.
EG: studypart table has data as below.
studyNo sqnc part approvalIN
--------------------------------
123 1 fgh Y
123 2 jhf N
123 3 rty N
456 1 wer N
456 2 wdg N
456 3 ghg N
I need query in such a way that my output from studypart table gives result
as study number which has all the approvalIn as N. If it has at least one of the approvalIn as 'Y'
then that studyno should be excluded from the result.
Desired output:
studyno: 456
I tried this implementation in stored procedure taking Y and N approvalIn count separately ie,
if a studyno has both the count then exclude it and
if it has only one count say either N or Y the include it.
But i would like to know how to achieve this is query.
You can do it by excluding those rows whose count of "approvalIN = 'N'" does not match the total count of "approvalIN" values.
SELECT STUDYNO
FROM tab
GROUP BY STUDYNO
HAVING SUM(CASE WHEN approvalIN = 'N' THEN 1 END) = COUNT(approvalIN)
Check the demo here.

How to pull duplicates in transactional data based on date and other fields

I am looking at transactional data such as my credit card statement. I want to ensure that I am not getting my card swiped twice. The fields that I have are card number (I have multiple), amount of transaction, transaction date, merchant code, merchant name, and transaction code.
To know if it is a true duplicate transaction, I want to know if the merchant code, merchant name, and transaction amount appear more the once. I also want to make sure that the transaction was within 5 days of each other if all else matches.
I am doing the work in SAS code, but I can also do in PROC SQL. So far in SAS I’ve sorted the data and then pulled a table that only holds duplicates, but since I’ve sorted the data, It will only call it a duplicate if the dates are the exact same date instead of the 5 days rule mentioned.
I did a simple PROC SORT.
PROC SORT DATA=WORK.TRANSACTIONS
OUT=WORK.TRANSACTIONS1
DUPOUT=WORK.SORTSORTEDDUPS
NODUPKEY;
BY CARD NUMBER TRANSACTION_AMOUNT TRANSACTION_DATE MERCHANT_CODE MERCHANT_NAME TRANSACTION_CODE
What do I need to incorporate to add my rule of transaction within 5 days?
You can do it with an additional pass, retaining (and comparing to) the last transaction date as per the below. Note the change in the sort BY statement (you'll need to update the proc sort also).
data duplicates;
set work.transactions1;
by BY CARD NUMBER TRANSACTION_AMOUNT MERCHANT_CODE MERCHANT_NAME TRANSACTION_CODE TRANSACTION_DATE;
retain datecheck 0;
if first.TRANSACTION_CODE then datecheck=0;
else if TRANSACTION_DATE-datecheck le 5 then output;
datecheck=TRANSACTION_DATE;
run;
Let's create our practice data source:
DATA MY_CREDIT_CARDS;
INPUT
C_NUMBER
TRANC_AMOUNT
TRANSC_DATE :DATE10.
TRANSC_CODE
MERCH_CODE
MERCH_NAME $10.;
FORMAT TRANSC_DATE DDMMYY10.;
CARDS;
1 100 17JAN1990 1 1 AMAZON
2 200 01JAN1990 2 8 WALLMART
4 100 04JAN1990 3 5 CRUSTYKRAB
2 200 07JAN1990 4 7 NETFLIX
1 300 01JAN1990 5 2 GOOGLEPLAY
3 200 17JAN1990 6 8 WALLMART
5 100 18JAN1990 7 2 GOOG.PLAY
5 300 19JAN1990 8 2 GOOGLEPLAY
2 200 22JAN1990 9 8 WALLMART
4 200 20JAN1990 10 2 GOOGLEPLAY
1 100 03JAN1990 11 2 GOOG.PLAY
1 100 17JAN1990 12 1 AMZN
;
RUN;
Result:
Now, first of all, I recommend not to use descriptive fields such as a names (merchant name in this case) as keys, because descriptive fields can be very variable, i.e. someone can register AMAZON as AMZN or AMAZN, or any combination you could imagine as the merchant name. Use ID fields instead. So, assuming merchant code is an unique ID, I think that is enough to identify the merchant.
Considering the above, using PROC SQL you could do something like this to find duplicates based on the rule you provide (and without the need of using any other extra-step):
PROC SQL;
/*The following assuming each record are unique
(identified by 'transaction code' in this case),
otherwise you must handle duplicate records properly.*/
SELECT
DISTINCT A.*,
CASE WHEN
B.TRANSC_CODE IS NOT NULL
THEN 1 ELSE 0 END AS DUPLICATED
FROM MY_CREDIT_CARDS AS A
LEFT JOIN MY_CREDIT_CARDS AS B
ON
A.MERCH_CODE = B.MERCH_CODE AND
A.TRANC_AMOUNT = B.TRANC_AMOUNT AND
A.TRANSC_CODE ^= B.TRANSC_CODE AND
A.TRANSC_DATE >= INTNX('day',B.TRANSC_DATE,-5) AND
A.TRANSC_DATE <= INTNX('day',B.TRANSC_DATE,5)
;
/*You could use an ORDER BY clause to sort the
results as you want.*/
RUN;
The result would be:
Now you have a new column named "DUPLICATED" showing 1 if found the value as duplicated and 0 if not.
Hope it helps.

access SQL count results using multiple sub queries against one table

I am using Access with a table having over 200k rows of data. I am looking for counts on a column which is broken down by job descriptions. For example, I want to return the total count (id) for a location where a person is status = "active" and position like "cook" [should equal 20] also another output where I get a count (id) for the same location where a person is status = "active" and position = "Lead Cook" [should equal 5]. So, one is a partial of the total population.
I have a few others to do just like this (# Bakers, # Lead Bakers...). How can I do this with one grand query/subquery or one query for each grouping.
My attempt is more like this:
SELECT
a.location,
Count(a.EMPLOYEE_NUMBER) AS [# Cook Total], --- should equal 20
(SELECT count(b.EMPLOYEE_ID) FROM Table_abc AS b where b.STATUS="Active Assignment" AND b.POSITION Like "*cook*" AND b.EMPLOYEE_ID=a.EMPLOYEE_ID) AS [# Lead Cook], --- should equal 5
FROM Table_abc AS a
ORDER BY a.location;
Results should be similar to:
Location Total Cooks Lead Cooks Total Bakers Lead Bakers
1 20 4 15 2
2 45 7 12 2
3 22 2 16 1
4 19 2 17 2
5 5 1 9 1
Try using conditional aggregation -- no need for sub queries.
Something like this should work (although I may not understand your desired results completely):
select location,
count(EMPLOYEE_NUMBER) as CookTotal,
sum(IIf(POSITION Like "*cook*",1,0)) as AllCooks,
sum(IIf(POSITION = "Lead Cook",1,0)) as LeadCooks
from Table_abc
where STATUS="Active Assignment"
group by location

How to label a big set of “transitive groups” with a constraint?

EDIT after #NealB solution: the #NealB's solution is very very fast comparated with any another one, and dispenses this new question about "add a constraint to improve performance". The #NealB's not need any improve, have O(n) time and is very simple.
The problem of "label transitive groups with SQL" have an elegant solution using recursion and CTE... But this solution consumes an exponential time (!). I need to work with 10000 itens: with 1000 itens need 1 second, with 2000 need 1 day...
Constraint: in my case is possible to break the problem into pieces of ~100 itens or less, but only to select one group of ~10 itens, and discard all the other ~90 labeled itens...
There are a generic algotithm to add and use this kind of "pre-selection", to reduce the quadratic, O(N^2), time? Perhaps, as showed by comments and #wildplasser, a O(N log(N)) time; but I expect, with "pre-selection" to reduce to O(N) time.
(EDIT)
I try to use alternative algorithm, but it need some improvement to use as solution here; or, to really increase performance (to O(N) time), need to use "pre-selection".
The "pre-selection" (constraint) is based on a "super-set grouping"... Stating by the original "How to label 'transitive groups' with SQL?" question t1 table,
table T1
(original T1 augmented by "super-set grouping label" ssg, and more one row)
ID1 | ID2 | ssg
1 | 2 | 1
1 | 5 | 1
4 | 7 | 1
7 | 8 | 1
9 | 1 | 1
10 | 11 | 2
So there are three groups,
g1: {1,2,5,9} because "1 t 2", "1 t 5" and "9 t 1"
g2: {4,7,8} because "4 t 7" and "7 t 8"
g3: {10,11} because "10 t 11"
The super-group is only a auxiliary grouping,
ssg1: {g1,g2}
ssg2: {g3}
If we have M super-group-items and N total T1 items, the average group length will be less tham N/M. We can suppose (for my typical problem) also that ssg maximum length is ~N/M.
So, the "label algorithm" need to run only M times with ~N/M items if it use the ssg constraint.
An SQL only soulution appears to be a bit of a problem here. With the help of some procedural
programming on top of SQL the solution appears to be failry simple and efficient. Here is a brief outline
of a solution as could be implemented using any procedural language invoking SQL.
Declare table R with primary key ID where ID corresponds the same domain as ID1 and ID2 of table T1.
Table R contains one other non-key column, a Label number
Populate table R with the range of values found in T1. Set Label to zero (no label).
Using your example data, the initial setup for R would look like:
Table R
ID Label
== =====
1 0
2 0
4 0
5 0
7 0
8 0
9 0
Using a host language cursor plus an auxiliary counter, read each row from T1. Lookup ID1 and ID2 in R. You will find one of
four cases:
Case 1: ID1.Label == 0 and ID2.Label == 0
In this case neither one of these IDs have been "seen" before: Add 1 to the counter and then update both
rows of R to the value of the counter: update R set R.Label = :counter where R.ID in (:ID1, :ID2)
Case 2: ID1.Label == 0 and ID2.Label <> 0
In this case, ID1 is new but ID2 has already been assigned a label. ID1 needs to be assigned to the
same label as ID2: update R set R.Lablel = :ID2.Label where R.ID = :ID1
Case 3: ID1.Label <> 0 and ID2.Label == 0
In this case, ID2 is new but ID1 has already been assigned a label. ID2 needs to be assigned to the
same label as ID1: update R set R.Lablel = :ID1.Label where R.ID = :ID2
Case 4: ID1.Label <> 0 and ID2.Label <> 0
In this case, the row contains redundant information. Both rows of R should contain the same Label value. If not,
there is some sort of data integrity problem. Ahhhh... not quite see edit...
EDIT I just realized that there are situations where both Label values here could be non-zero and different. If both are non-zero and different then two Label groups need to be merged at this point. All you need to do is choose one Label and update the others to match with something like: update R set R.Label to ID1.Label where R.Label = ID2.Label. Now both groups have been merged with the same Label value.
Upon completion of the cursor, table R will contain Label values needed to update T2.
Table R
ID Label
== =====
1 1
2 1
4 2
5 1
7 2
8 2
9 1
Process table T2
using something along the lines of: set T2.Label to R.Label where T2.ID1 = R.ID. The end result should be:
table T2
ID1 | ID2 | LABEL
1 | 2 | 1
1 | 5 | 1
4 | 7 | 2
7 | 8 | 2
9 | 1 | 1
This process is puerly iterative and should scale to fairly large tables without difficulty.
I suggest you check this and use some
general-purpose language for solving it.
http://en.wikipedia.org/wiki/Disjoint-set_data_structure
Traverse the graph, maybe run DFS or BFS from each node,
then use this disjoint set hint. I think this should work.
The #NealB solution is the faster(!) See an example of PostgreSQL implementation here.
Below an example of another "brute force algorithm", only for curiosity!
As #peter.petrov and #RBarryYoung suggested, some performance problems can be avoided abandoning the CTE recursion... I do some issues at the basic labeler, and, abover I add the constraint for grouping by a super-set label. This new transgroup1_loop() function is working!
PS: this solution still have performance limitations, please post your answer with better, or with some adaptation of this one.
-- DROP table transgroup1;
CREATE TABLE transgroup1 (
id serial NOT NULL PRIMARY KEY,
items integer[], -- two or more items in the transitive relationship
ssg_label varchar(12), -- the super-set gropuping label
dels integer[] DEFAULT array[]::integer[]
);
INSERT INTO transgroup1(items,ssg_label) values
(array[1, 2],'1'),
(array[1, 5],'1'),
(array[4, 7],'1'),
(array[7, 8],'1'),
(array[9, 1],'1'),
(array[10, 11],'2');
-- or SELECT array[id1, id2],ssg_label FROM t1, with 10000 items
them, with these two functions we can solve the problem,
CREATE FUNCTION transgroup1_loop(p_ssg varchar, p_max_i integer DEFAULT 100)
RETURNS integer AS $funcBody$
DECLARE
cp_dels integer[];
i integer;
BEGIN
i:=1;
LOOP
UPDATE transgroup1
SET items = array_uunion(transgroup1.items,t2.items),
dels = transgroup1.dels || t2.id
FROM transgroup1 AS t1, transgroup1 AS t2
WHERE transgroup1.id=t1.id AND t1.ssg_label=$1 AND
t1.id>t2.id AND t1.items && t2.items;
cp_dels := array(
SELECT DISTINCT unnest(dels) FROM transgroup1
); -- ensures all itens to del
RAISE NOTICE '-- bug, repeting dels, item-%; % dels! %', i, array_length(cp_dels,1), array_to_string(cp_dels,';','*');
EXIT WHEN i>p_max_i OR array_length(cp_dels,1)=0;
DELETE FROM transgroup1
WHERE ssg_label=$1 AND id IN (SELECT unnest(cp_dels));
UPDATE transgroup1 SET dels=array[]::integer[];
i:=i+1;
END LOOP;
UPDATE transgroup1 -- only to beautify
SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc);
RETURN i;
END;
$funcBody$ LANGUAGE plpgsql VOLATILE;
to run and see results, you can use
SELECT transgroup1_loop('1'); -- run with ssg-1 items only
SELECT transgroup1_loop('2'); -- run with ssg-2 items only
-- show all with a sequential group label:
SELECT *, dense_rank() over (ORDER BY id) AS group_label from transgroup1;
results:
id | items | ssg_label | dels | group_label
----+-----------+-----------+------+-------------
4 | {8,7,4} | 1 | {} | 1
5 | {9,5,2,1} | 1 | {} | 2
6 | {11,10} | 2 | {} | 3
PS: the function array_uunion() is the same as original,
CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$
-- ensures distinct items of a concatemation
SELECT ARRAY(SELECT unnest($1) UNION SELECT unnest($2))
$$ LANGUAGE sql immutable;

MS Access Update with Increment of Prior Record

I have an MS Access 2007 database that I need to create an update for. The table I am trying to update looks like this:
CarID WeekOf NumDataPoints NumWksZeroPoints
3AA May-14-2011 23 0
7BB May-14-2011 9 0
3AA May-21-2011 35 0
7BB May-21-2011 0 1
3AA May-28-2011 24
7BB May-28-2011 0
I am processing the latest recordset of May-28-2011 and the gist is to update each car with the number of weeks its had no data points. I do this by checking the current week number of points and if it does have some points then the #WeeksZeroPoints gets set to zero, and if the current number of points is zero then i take the prior weeks count and increment by one. For my last week I would have input
0
2
So I have tried something like
UPDATE tblCars
SET NumWksZeroPoints = IIF(NumDataPoints<>0, 0, (SELECT MAX(NumWksZeroPoints) AS wzp
FROM tblCars AS f
WHERE f.CarID=tblCars.CarID AND
f.WeekEnding=#5/21/2011#) + 1
)
WHERE WeekOf=#5/28/2011#;
Unfortunately this doesn't work like I thought it would. I think I have the concept down and most of the SQL, I just cant seem to make it work. This is against MS Access so some of the other tricks I know just don't work. Any help appreciated.
You could (and some might say should) do this as a query, without updating the table. If you are capturing the datapoints per week per car, your query can compute the number of weeks a car has had no data points using date math. What happens if someone inserts data for a car after you have run your update? You end up with data that are inconsistent.
Using your sample data I ran the following
UPDATE tblcar AS c
INNER JOIN tblcar AS previous
ON c.carid = previous.carid
SET c.numwkszeropoints = Iif([previous].[NumWksZeroPoints] = 0, 0,
[previous].[NumWksZeroPoints] + 1)
WHERE c.weekof =#5/28/2011 #
AND previous.weekof =#5/21/2011#;
The table afterwards looked like this
CarID WeekOf NumDataPoints NumWksZeroPoints
----- ---------- ------------- -----------------
3AA 05/14/2011 23 0
7BB 05/14/2011 9 0
3AA 05/21/2011 35 0
7BB 05/21/2011 0 1
3AA 05/28/2011 24 0
7BB 05/28/2011 0 2
Basically the query does a self join back to the previous week, and the update the current week to the previous week's value + 1 if its not zero.