SPSS: search for repeated occurrences at previous time points - repeat

I have a dataset as such:
Case #|DateA |Drug.1|Drug.2|Drug.3|DateB.1 |DateB.2 |DateB.3 |RcvdDrugXTimeA
------|------|------|------|------|--------|---------|--------|--
1 |DateA1| X | Y | X |DateB1.1|DateB1.2 |DateB1.3| 0
2 |DateA2| X | Y | X |DateB2.1|DateB2.2 |DateB2.3| 1
3 |DateA3| Y | Z | X |DateB3.1|DateB3.2 |DateB3.3| 1
4 |DateA4| Z | Z | Z |DateB4.1|DateB4.2 |DateB4.3| 0
Each case has many records of receiving drugs at different dates. My endpoint is receiving Drug X at Time A, where Time is calculated based on the difference between DateA and DateB for each receipt of a drug.
If a case receives Drug X at Time A, then I want to know if they also received Drug X at a previous date (to determine if them receiving Drug X at Time A is a new occurrence or a repeat occurrence).
For each case that received Drug X at least once during Time A, I have variable RcvdDrugXTimeA = 1.
If they did receive Drug X before Time A, then I want RcvdPriorDrugX = 2. If they did not, then I want RcvdPriorDrugX = 1. If they never received Drug X, then ideally RcvdPriorDrugX = 0.
In the example above, Case 1 received Drug X but not within Time A, whereas Cases 2 and 3 both received Drug X during Time A, thus RcvdDrugXTimeA = 1 for these cases. Say that this occurred for both of them at DateB.3. The difference between these cases is that Case 2 also received Drug X previously, thus making the receipt of Drug X at Time A a repeat occurrence. On the other hand, Case 3 did not receive Drug X previously, thus making the receipt of Drug X at Time A a new occurrence for this case. I would like to create a function that looks back across the records for those cases where the criterion of receiving Drug X during Time A is met, and to produce an output to tell me if this has happened previously or if it's a new occurrence. I would like to exclude those cases where they never received Drug X at Time A in the output variable.

Not completely sure I understand what you're trying to do, so the following syntax covers the example you gave and reaches the wanted output you described. If the example (and my syntax) don't cover other possibilities or considerations you need to include in the process, please let me know.
first I recreate your example dataset:
data list list/Case#(f1) DateA Drug.1 Drug.2 Drug.3 DateB.1 DateB.2 DateB.3 (7a10) RcvdDrugXTimeA(f1).
begin data
1 "DateA1" "X" "Y" "X" "DateB1.1" "DateB1.2" "DateB1.3" 0
2 "DateA2" "X" "Y" "X" "DateB2.1" "DateB2.2" "DateB2.3" 1
3 "DateA3" "Y" "Z" "X" "DateB3.1" "DateB3.2" "DateB3.3" 1
4 "DateA4" "Z" "Z" "Z" "DateB4.1" "DateB4.2" "DateB4.3" 0
end data.
Now I count the number of times drug X was administered, and then if RcvdDrugXTimeA is 0 the output is 0, if it is one the output is 1 if X was only administered once, or 2 if it was administered more than once:
count NofX=Drug.1 Drug.2 Drug.3("X").
compute RcvdPriorDrugX=RcvdDrugXTimeA + (NofX > 1).

Related

How can I match a sequence of numbers with the specific columns based on their names in BigQuery with the SQL query result groupped by this sequence?

I am working with two tables in BigQuery and I need the achieve the following result:
ret_d
ad_arpu
pur_arpu
d1
x
x
d2
y
y
dx
etc
etc
A bit of clarification: ARPU is average revenue per user, which in this case consists of revenbue from purchases (pur_arpu) and from ads (ad_arpu).
I need to return pur_arpu and ad_arpu next to the corresponding day that customer has been retained, i.e. retention day #x (ret_dx). Arpu is revenue divided by number of customers at a certain period of time. I need arpu per user per retained day.
In the table I am working with I have the following relevant columns
C_id
ret_d1
ret_d2
ret_dx
ad_rev_d1
ad_rev_d2
ad_rev_dx
pur_rev_d1
pur_rev_d2
pur_rev_dx
id1
0 OR 1
0 OR 1
0 OR 1
x
x
x
x
x
x
id2
0 OR 1
0 OR 1
0 OR 1
y
y
y
y
y
y
idx
0 OR 1
0 OR 1
0 OR 1
etc
etc
etc
etc
etc
etc
c_id - customer id,
ret_dx - retained day #x, 1 stands for user retaining on that day,
ad_rev_dx - revenue from ads on dx,
pur_rev_dx - revenue from purchases on dx.
My idea of the SQL query was to return a sequence of numbers, ad_arpu, and pur_arpu from subquery. So something alogn these lines
SELECT
["d1", "d2", "dx"] AS ret_d,
ad_arpu AS ad_arpu,
pur_arpu AS pur_arpu
FROM
(SELECT
SUM(ad_rev_d1)/SUM(ret_d1) AS ad_arpu_d1,
SUM(ad_rev_d2)/SUM(ret_d2) AS ad_arpu_d2,
SUM(ad_rev_dx)/SUM(ret_dx) AS ad_arpu_dx,
SUM(pur_rev_d1)/SUM(ret_d1) AS pur_arpu_d1,
SUM(pur_rev_d2)/SUM(ret_d2) AS pur_arpu_d2,
SUM(pur_rev_dx)/SUM(ret_dx) AS pur_arpu_dx
--somehow match ad_arpu_dx and pur_arpu_dx to the corresponding days from the ret_d sequence to get ad_arpu adn pur_arpu
FROM tablex)
But so far my efforts have been getting me nowhere :( Missing how to match arpu numbers and the day sequence numbers.
I really appreciate any tips that could help with this!

The King's March

You’re given a chess board with dimension n x n. There’s a king at the bottom right square of the board marked with s. The king needs to reach the top left square marked with e. The rest of the squares are labeled either with an integer p (marking a point) or with x marking an obstacle. Note that the king can move up, left and up-left (diagonal) only. Find the maximum points the king can collect and the number of such paths the king can take in order to do so.
Input Format
The first line of input consists of an integer t. This is the number of test cases. Each test case contains a number n which denotes the size of board. This is followed by n lines each containing n space separated tokens.
Output Format
For each case, print in a separate line the maximum points that can be collected and the number of paths available in order to ensure maximum, both values separated by a space. If e is unreachable from s, print 0 0.
Sample Input
3
3
e 2 3
2 x 2
1 2 s
3
e 1 2
1 x 1
2 1 s
3
e 1 1
x x x
1 1 s
Sample Output
7 1
4 2
0 0
Constraints
1 <= t <= 100
2 <= n <= 200
1 <= p <= 9
I think this problem could be solved using dynamic-programing. We could use dp[i,j] to calculate the best number of points you can obtain by going from the right bottom corner to the i,j position. We can calculate dp[i,j], for a valid i,j, based on dp[i+1,j], dp[i,j+1] and dp[i+1,j+1] if this are valid positions(not out of the matrix or marked as x) and adding them the points obtained in the i,j cell. You should start computing from the bottom right corner to the left top, row by row and beginning from the last column.
For the number of ways you can add a new matrix ways and use it to store the number of ways.
This is an example code to show the idea:
dp[i,j] = dp[i+1,j+1] + board[i,j]
ways[i,j] = ways[i+1,j+1]
if dp[i,j] < dp[i+1,j] + board[i,j]:
dp[i,j] = dp[i+1,j] + board[i,j]
ways[i,j] = ways[i+1,j]
elif dp[i,j] == dp[i+1,j] + board[i,j]:
ways[i,j] += ways[i+1,j]
# check for i,j+1
This assuming all positions are valid.
The final result is stored in dp[0,0] and ways[0,0].
Brief Overview:
This problem can be solved through recursive method call, starting from nn till it reaches 00 which is the king's destination.
For the detailed explanation and the solution for this problem,check it out here -> https://www.callstacker.com/detail/algorithm-1

Working of Merge in SAS (with IN=)

I have two dataset data1 and data2
data data1;
input sn id $;
datalines;
1 a
2 a
3 a
;
run;
data data2;
input id $ sales x $;
datalines;
a 10 x
a 20 y
a 30 z
a 40 q
;
run;
I am merging them from below code:
data join;
merge data1(in=a) data2(in=b);
by id;
if a and b;
run;
Result: (I was expecting an Inner Join result which is not the case)
1 a 10 x
2 a 20 y
2 a 30 z
2 a 40 w
Result from proc sql inner join.
proc sql;
select data1.id,sn,sales,x from data2 inner join data1 on data1.hh_id;
quit;
Result: (As expected from an inner join)
a 1 10 x
a 1 20 y
a 1 30 z
a 1 40 w
a 2 10 x
a 2 20 y
a 2 30 z
a 2 40 w
b 3 10 x
b 3 20 y
b 3 30 z
b 3 40 w
I want to know the concept and STEP BY STEP working of merge statement in SAS with In= and proving the above result.
PS: I have read this, and it says
An obvious use for these variables is to control what kind of 'merge'
will occur, using if statements. For example, if
ThisRecordIsFromYourData and ThisRecordIsFromOtherData; will make SAS
only include rows that match on the by variables from both input data
sets (like an inner join).
which I guess, (like an Inner Join) is not always the case.
Basically, this is a result of the difference in how the SAS data step and SQL process their respective join/merges.
SQL creates a separate record for each possible combination of keys. This is a Cartesian Product (at the key level).
SAS data step, however, process merges very differently. MERGE is really nothing more than a special case of SET. It still processes rows iteratively, one at a time - it never goes back, and never has more than one row from any dataset in the PDV at once. Thus, it cannot create a Cartesian product in its normal process - that would require random access, which the SAS datastep doesn't do normally.
What it does:
For each unique BY value
Take the next record from the left side dataset, if one exists with that BY value
Take the next record from the right side dataset, if one exists with that BY value
Output a row
Continue until both datasets are exhausted for that BY value
With BY values that yield unique records per value on either side (or both), it is effectively identical to SQL. However, with BY values that yield duplicates on BOTH sides, you get what you have there: a side-by-side merge, and if one runs out before the other, the values from the last row of the shorter dataset (for that by value) are more-or-less copied down. (They're actually RETAINED, so if you overwrite them with changes, they will not reset on new records from the longer dataset).
So, if left has 3 records and right has 4 records for key value a, like in your example, then you get data from the following records (assuming you don't alter the data after):
left right
1 1
2 2
3 3
3 4

How do I appropriately use a wildcard to select columns and build a new field in Access 2010?

This post is related in several aspects to the following:
Selecting all columns that start with XXX using a wildcard?
I am currently using Access 2010. I would like to add new columns to my table, based off values of the other columns.
Current table (Table #1):
Row | PlaceID | FoodItem1_10 | FoodItem1_02 | FoodItem2_10 | FoodItem2_02
001 Park Y N Y N
002 Library Y N Y N
003 Museum Y N Y N
Where:
Item1_10....ItemN_10 is a field where a value of 'Y' (for Yes) is assigned if, at a particular location, they sell that food item only 10 months of the year. Otherwise, the value is 'N' for No.
Item1_02....ItemN_02 is a field where a value of 'Y' is assigned if, at a particular location, they sell that food item only 02 months of the year. Otherwise, the value is 'N' for No.
I want to add columns to Table #1, and have it look as follows:
Desired new table (Table #2):
Row | PlaceID | FoodItem1_10 | FoodItem1_02 | FoodItem2_10 | FoodItem2_02 | AnyItems_10months | AnyItems_02months
001 Park Y N Y N Y N
002 Library Y Y Y N Y Y
003 Museum Y N Y N Y N
Where:
AnyItems_10months is a field that captures whether or not a particular place sells any items for a 10 month period. This field takes the values 'Y' for when, in any column, the particular place has a value of 'Y' for columns Item1_10 ..... ItemN_10.
AnyItems_02months is a field that captures whether or not a particular place sells any items for a 02 month period. This field takes the values 'Y' for when, in any column, the particular place has a value of 'Y' for columns Item1_02 ..... ItemN_02.
What I have been trying:
Since my columns follow a particular naming pattern, I thought it would be best to use a wildcard to generate my two new columns as such:
Obstacle
-Access does not accept my expression.
Why don't you just hard-code it into a query? You're not going to be able to make a field like that in a table without reading the .Fields property of the table. It would get really messy. If you're always going to do it the same way, doing it in a query is going to be the easiest way.

How do I remove contiguous sequences of almost identical records from database

I have a SQL Server database containing real-time stock quotes.
There is a Quotes table containing what you would expect-- a sequence number, ticker symbol, time, price, bid, bid size, ask, ask size, etc.
The sequence number corresponds to a message that was received containing data for a set of ticker symbols being tracked. A new message (with a new, incrementing sequence number) is received whenever anything changes for any of the symbols being tracked. The message contains data for all symbols (even for those where nothing changed).
When the data was put into the database, a record was inserted for every symbol in each message, even for symbols where nothing changed since the prior message. So a lot of records contain redundant information (only the sequence number changed) and I want to remove these redundant records.
This is not the same as removing all but one record from the entire database for a combination of identical columns (already answered). Rather, I want to compress each contiguous block of identical records (identical except for sequence number) into a single record. When finished, there may be duplicate records but with differing records between them.
My approach was to find contiguous ranges of records (for a ticker symbol) where everything is the same except the sequence number.
In the following sample data I simplify things by showing only Sequence, Symbol, and Price. The compound primary key would be Sequence+Symbol (each symbol appears only once in a message). I want to remove records where Price is the same as the prior record (for a given ticker symbol). For ticker X it means I want to remove the range [1, 6], and for ticker Y I want to remove the ranges [1, 2], [4, 5] and [7, 7]:
Before:
Sequence Symbol Price
0 X $10
0 Y $ 5
1 X $10
1 Y $ 5
2 X $10
2 Y $ 5
3 X $10
3 Y $ 6
4 X $10
4 Y $ 6
5 X $10
5 Y $ 6
6 X $10
6 Y $ 5
7 X $11
7 Y $ 5
After:
Sequence Symbol Price
0 X $10
0 Y $ 5
3 Y $ 6
6 Y $ 5
7 X $11
Note that (Y, $5) appears twice but with (Y, $6) between.
The following generates the ranges I need. The left outer join ensures I select the first group of records (where there is no earlier record that is different), and the BETWEEN is intended to reduce the number of records that need to be searched to find the next-earlier different record (the results are the same without the BETWEEN, but slower). I would need only to add something like "DELETE FROM Quotes WHERE Sequence BETWEEN StartOfRange AND EndOfRange".
SELECT
GroupsOfIdenticalRecords.Symbol,
MIN(GroupsOfIdenticalRecords.Sequence)+1 AS StartOfRange,
MAX(GroupsOfIdenticalRecords.Sequence) AS EndOfRange
FROM
(
SELECT
Q1.Symbol,
Q1.Sequence,
MAX(Q2.Sequence) AS ClosestEarlierDifferentRecord
FROM
Quotes AS Q1
LEFT OUTER JOIN
Quotes AS Q2
ON
Q2.Sequence BETWEEN Q1.Sequence-100 AND Q1.Sequence-1
AND Q2.Symbol=Q1.Symbol
AND Q2.Price<>Q1.Price
GROUP BY
Q1.Sequence,
Q1.Symbol
) AS GroupsOfIdenticalRecords
GROUP BY
GroupsOfIdenticalRecords.Symbol,
GroupsOfIdenticalRecords.ClosestEarlierDifferentRecord
The problem is that this is way too slow and runs out of memory (crashing SSMS- remarkably) for the 2+ million records in the database. Even if I change "-100" to "-2" it is still slow and runs out of memory. I expected the "ON" clause of the LEFT OUTER JOIN to limit the processing and memory usage (2 million iterations, processing about 100 records each, which should be tractable), but it seems like SQL Server may first be generating all combinations of the 2 instances of the table, Q1 and Q2 (about 4e12 combinations) before selecting based on the criteria specified in the ON clause.
If I run the query on a smaller subset of the data (for example, by using "(SELECT TOP 100000 FROM Quotes) AS Q1", and similar for Q2), it completes in a reasonable amount time. I was trying to figure out how to automatically run this 20 or so times using "WHERE Sequence BETWEEN 0 AND 99999", then "...BETWEEN 100000 AND 199999", etc. (actually I would use overlapping ranges such as [0,99999], [99900, 199999], etc. to remove ranges that span boundaries).
The following generates sets of ranges to split the data into 100000 record blocks ([0,99999], [100000, 199999], etc). But how do I apply the above query repeatedly (once for each range)? I keep getting stuck because you can't group these using "BETWEEN" without applying an aggregate function. So instead of selecting blocks of records, I only know how to get MIN(), MAX(), etc. (single values) which does not work with the above query (as Q1 and Q2). Is there a way to do this? Is there totally different (and better) approach to the problem?
SELECT
CONVERT(INTEGER, Sequence / 100000)*100000 AS BlockStart,
MIN(((1+CONVERT(INTEGER, Sequence / 100000))*100000)-1) AS BlockEnd
FROM
Quotes
GROUP BY
CONVERT(INTEGER, Sequence / 100000)*100000
You can do this with a nice little trick. The groups that you want can be defined as the difference between two sequences of numbers. One is assigned for each symbol in order by sequence. The other is assigned for each symbol and price. This is what is looks like for your data:
Sequence Symbol Price seq1 seq2 diff
0 X $10 1 1 0
0 Y $ 5 1 1 0
1 X $10 2 2 0
1 Y $ 5 2 2 0
2 X $10 3 3 0
2 Y $ 5 3 3 0
3 X $10 4 4 0
3 Y $ 6 4 1 3
4 X $10 5 5 0
4 Y $ 6 5 2 3
5 X $10 6 6 0
5 Y $ 6 6 3 3
6 X $10 7 7 0
6 Y $ 5 7 4 3
7 X $11 8 1 7
7 Y $ 5 8 5 3
You can stare at this and figure out that the combination of symbol, diff, and price define each group.
The following puts this into a SQL query to return the data you want:
select min(q.sequence) as sequence, symbol, price
from (select q.*,
(row_number() over (partition by symbol order by sequence) -
row_number() over (partition by symbol, price order by sequence)
) as grp
from quotes q
) q
group by symbol, grp, price;
If you want to replace the data in the original table, I would suggest that you store the results of the query in a temporary table, truncate the original table, and then re-insert the values from the temporary table.
Answering my own question. I want to add some additional comments to complement the excellent answer by Gordon Linoff.
You're right. It is a nice little trick. I had to stare at it for a while to understand how it works. Here's my thoughts for the benefit of others.
The numbering by Sequence/Symbol (seq1) always increases, whereas the numbering by Symbol/Price (seq2) only increases sometimes (within each group, only when a record for Symbol contains the group's Price). Therefore seq1 either remains in lock step with seq2 (i.e., diff remains constant, until either Symbol or Price changes), or seq1 "runs away" from seq2 (while it is busy "counting" other Prices and other Symbols-- which increases the difference between seq1 and seq2 for a given Symbol and Price). Once seq2 falls behind, it can never "catch up" to seq1, so a given value of diff is never seen again once diff moves to the next larger value (for a given Price). By taking the minimum value within each Symbol/Price group, you get the first record in each contiguous block, which is exactly what I needed.
I don't use SQL a lot, so I wasn't familiar with the OVER clause. I just took it on faith that the first clause generates seq1 and the second generates seq2. I can kind of see how it works, but that's not the interesting part.
My data contained more than just Price. It was a simple thing to add the other fields (Bid, Ask, etc.) to the second OVER clause and the final GROUP BY:
row_number() over (partition by Symbol, Price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time order by Sequence)
group by Symbol, grp, price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time
Also, I was able to use use >MIN(...) and <=MAX(...) to define ranges of records to delete.