How do you mark unique occurrences in a pattern given that value are unique when occurring simultaneously and not when they come separately? [closed] - sql

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Suppose my data looks like this
student article.bought
1 A pen
2 B pencil
3 V book
4 A pen
5 A inkbottle
6 B pen
7 B pencil
8 B pencil
9 V book
10 Z marker
11 A inkbottle
12 V book
13 V pen
14 V book
I need unique occurrences of articles probably in a different column like this
student article.bought Occurences
1 A pen 1
2 B pencil 1
3 V book 1
4 A pen 1 # as A is taking a pen again
5 A inkbottle 2 # 'A' changed from pen to ink bottle
6 B pen 2
7 B pencil 3 # though B took pencil before, this is different as he took a pen in between
8 B pencil 3
9 V book 1
10 Z marker 1
11 A inkbottle 2
12 V book 1
13 V pen 2
14 V book 3

In R, we can find changes in a student's selection by finding the difference, diff, of each subsequent value. When we take the cumulative sum, cumsum, of that logical index we get a running count of occurrences.
In the second line we coerce the factor variable article.bought to numeric and run the function from the first line using ave to group the function f by student.
f <- function(x) cumsum(c(F, diff(x) != 0)) + 1
df$Occurences <- with(df, ave(as.numeric(article.bought), student, FUN=f))
df
# student article.bought Occurences
# 1 A pen 1
# 2 B pencil 1
# 3 V book 1
# 4 A pen 1
# 5 A inkbottle 2
# 6 B pen 2
# 7 B pencil 3
# 8 B pencil 3
# 9 V book 1
# 10 Z marker 1
# 11 A inkbottle 2
# 12 V book 1
# 13 V pen 2
# 14 V book 3

create additional column [Original Sort Order] and enumerate from 1
to ...
sort table by student / orig sort order
enter =IF(A2=A1,IF(B2=B1,D1,D1+1),1) in D2 and copy down
convert column D to values (copy, paste as ... Values)
restore original sort order
If this is more than a one-off, use the same tactic to create a VBA script

A shot with SAS:
data try00;
length student article $20;
infile datalines dlm=' ';
input student $ article $;
datalines;
A pen
B pencil
V book
A pen
A inkbottle
B pen
B pencil
B pencil
V book
Z marker
A inkbottle
V book
V pen
V book
;
data try01;
set try00;
pos=_n_;
run;
proc sort data=try01 out=try02; by student pos article; run;
proc sort data=try02 out=stud(keep=student) nodupkey; by student; run;
data shell;
length occurrence 8.;
set try02;
if _n_>0 then delete;
run;
%macro loopstudent();
data _null_; set stud end=eof; if eof then call symput("nstu",_n_); run;
%do i=1 %to &nstu;
data _null_; set stud; if _n_=&i then call symput("stud&i",student); run;
data thisstu;
set try02;
where student="&&stud&i";
dummyart=lag(article);
retain occurrence 0;
if dummyart ne article then occurrence=occurrence+1;
else occurrence=occurrence;
drop dummyart;
run;
proc append base=shell data=thisstu; run;
%end;
proc sort data=shell out=final; by pos; run;
%mend loopstudent; %loopstudent();
dataset "final" has the result.

Related

Replace value in column based on value in another column

I have a dataframe with 3240 rows and 3 columns. Column Block represents the block in which values in column A and B appeared. Unique number of blocks is 6 but they are repeating in sequence throughout whole dataframe from 1-6. Values in column A are repeating themselves in the sequences of exact order from 1-10 throughout the whole dataframe (blocks). Values in column B exist from a-j (n = 10), but they repeating themselves in random order in sequences from a-j, so they are never duplicated within the Block.
So in each of 6 Blocks, values in column A (1-10) repeat themselves in exact order from 1-10, while In column B, values (a-j) repeat themselves in random order.
Df looks like this:
Block A B ID
1 1 a XY
1 2 b XY
1 3 c XY
1 4 d XY
1 5 e XY
1 6 f XY
1 7 g XY
1 8 h XY
1 9 i XY
1 10 j XY
....
6 1 d XY
...
6 6 j XY
....
1 1 g XX
1 2 a XX
Throughout dataframe i would like to replace all values in column B based on corresponding value in column A for each separate Block. Logic would be to replace values in column B based on values in column A by this pattern 1=6, 2=7, 3=8, 4=9, 5=10.
Result would look like this:
Block A B ID
1 1 f XY
1 2 g XY
1 3 h XY
1 4 i XY
1 5 j XY
1 6 a XY
1 7 b XY
1 8 c XY
1 9 d XY
1 10 e XY
....
6 1 j XY
...
6 6 d XY
....
1 1 g XX
1 2 a XX
What would be an efficient to do this?
You want to identify the block of 5 within each block of 10 and swap them. This is my solution:
df['B'] = (df.assign(blk_5 = (np.arange(len(df))//5+1) % 2,
blk_10 = np.arange(len(df)) // 10
)
.sort_values(['Block','blk_10','blk_5'])
['B'].values
)

SAS sum observations not in a group, by multiple groups

This post follow this one: SAS sum observations not in a group, by group
Where my minimal example was a bit too minimal sadly,I wasn't able to use it on my data.
Here is a complete case example, what I have is :
data have;
input group1 group2 group3 $ value;
datalines;
1 A X 2
1 A X 4
1 A Y 1
1 A Y 3
1 B Z 2
1 B Z 1
1 C Y 1
1 C Y 6
1 C Z 7
2 A Z 3
2 A Z 9
2 A Y 2
2 B X 8
2 B X 5
2 B X 5
2 B Z 7
2 C Y 2
2 C X 1
;
run;
For each group, I want a new variable "sum" with the sum of all values in the column for the same sub groups (group1 and group2), exept for the group (group3) the observation is in.
data want;
input group1 group2 group3 $ value $ sum;
datalines;
1 A X 2 8
1 A X 4 6
1 A Y 1 9
1 A Y 3 7
1 B Z 2 1
1 B Z 1 2
1 C Y 1 13
1 C Y 6 8
1 C Z 7 7
2 A Z 3 11
2 A Z 9 5
2 A Y 2 12
2 B X 8 17
2 B X 5 20
2 B X 5 20
2 B Z 7 18
2 C Y 2 1
2 C X 1 2
;
run;
My goal is to use either datasteps or proc sql (doing it on around 30 millions observations and proc means and such in SAS seems slower than those on previous similar computations).
My issue with solutions provided in the linked post is that is uses the total value of the column and I don't know how to change this by using the total in the sub group.
Any idea please?
A SQL solution will join all data to an aggregating select:
proc sql;
create table want as
select have.group1, have.group2, have.group3, have.value
, aggregate.sum - value as sum
from
have
join
(select group1, group2, sum(value) as sum
from have
group by group1, group2
) aggregate
on
aggregate.group1 = have.group1
& aggregate.group2 = have.group2
;
SQL can be slower than hash solution, but SQL code is understood by more people than those that understand SAS DATA Step involving hashes ( which can be faster the SQL. )
data want2;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group1', 'group2');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve group sum.;
sum = sum - value; * subtract from group sum;
output;
end;
stop;
run;
SAS documentation touches on SUMINC and has some examples
The question does not address this concept:
For each row compute the tier 2 sum that excludes the tier 3 this row is in
A hash based solution would require tracking each two level and three level sums:
data want2;
if 0 then set have; * prep pdv;
declare hash T2 (suminc:'value'); * hash for two (T)iers;
T2.defineKey('group1', 'group2'); * one hash record per combination of group1, group2;
T2.defineDone();
declare hash T3 (suminc:'value'); * hash for three (T)iers;
T3.defineKey('group1', 'group2', 'group3'); * one hash record per combination of group1, group2, group3;
T3.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
T2.ref(); * adds value to internal sum of hash data record;
T3.ref();
end;
T2_cardinality = T2.num_items;
T3_cardinality = T3.num_items;
put 'NOTE: |T2| = ' T2_cardinality;
put 'NOTE: |T3| = ' T3_cardinality;
do while (not last_have);
set have end=last_have;
T2.sum(sum:t2_sum);
T3.sum(sum:t3_sum);
sum = t2_sum - t3_sum;
output;
end;
stop;
drop t2_: t3:;
run;

SAS sum observations not in a group, by group

I have a data set :
data have;
input group $ value;
datalines;
A 4
A 3
A 2
A 1
B 1
C 1
D 2
D 1
E 1
F 1
G 2
G 1
H 1
;
run;
The first variable is a group identifier, the second a value.
For each group, I want a new variable "sum" with the sum of all values in the column, exept for the group the observation is in.
My issue is having to do that on nearly 30 millions of observations, so efficiency matters.
I found that using data step was more efficient than using procs.
The final database should looks like :
data want;
input group $ value $ sum;
datalines;
A 4 11
A 3 11
A 2 11
A 1 11
B 1 20
C 1 20
D 2 18
D 1 18
E 1 20
F 1 20
G 2 18
G 1 20
H 1 20
;
run;
Any idea how to perform this please?
Edit: I don't know if this matter but the example I gave is a simplified version of my issue. In the real case, I have 2 other group variable, thus taking the sum of the whole column and substract the sum in the group is not a viable solution.
The requirement
sum of all values in the column, except for the group the observation is in
indicates two passes of the data must occur:
Compute the all_sum and each group's group_sumA hash can store each group's sum -- computed via a specified suminc: variable and .ref() method invocation. A variable can accumulate allsum.
Compute allsum - group_sum for each row of a group.The group_sum is retrieved from hash and subtracted from allsum.
Example:
data want;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
allsum + value;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve groups sum. Do you hear the Dragnet theme too?;
sum = allsum - sum; * subtract from allsum;
output;
end;
stop;
run;
What is wrong with a straight forward approach? You need to make two passes no matter what you do.
Like this. I included extra variables so you can see how the values are derived.
proc sql ;
create table want as
select a.*,b.grand,sum(value) as total, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results:
Obs group value grand total sum
1 A 3 21 10 11
2 A 1 21 10 11
3 A 2 21 10 11
4 A 4 21 10 11
5 B 1 21 1 20
6 C 1 21 1 20
7 D 2 21 3 18
8 D 1 21 3 18
9 E 1 21 1 20
10 F 1 21 1 20
11 G 1 21 3 18
12 G 2 21 3 18
13 H 1 21 1 20
Note it does not matter what you have as your GROUP BY clause.
Do you really need to output all of the original observations? Why not just output the summary table?
proc sql ;
create table want as
select a.group, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results
Obs group total sum
1 A 10 11
2 B 1 20
3 C 1 20
4 D 3 18
5 E 1 20
6 F 1 20
7 G 3 18
8 H 1 20
I would break this out into two different segments:
1.) You could start by using PROC SQL to get the sums by the group
2.) Then use some IF/THEN statements to reassign the values by group

How to do a last observation carrying forward using SAS PROC SQL

I have the data below. I want to write a sas proc sql code to get the last non-missing values for each patient(ptno).
data sda;
input ptno visit weight;
format ptno z3. ;
cards;
1 1 122
1 2 123
1 3 .
1 4 .
2 1 156
2 2 .
2 3 70
2 4 .
3 1 60
3 2 .
3 3 112
3 4 .
;
run;
proc sql noprint;
create table new as
select ptno,visit,weight,
case
when weight = . then weight
else .
end as _weight_1
from sda
group by ptno,visit
order by ptno,visit;
quit;
The sql code above does not work well.
The desire output data like this:
ptno visit weight
1 1 122
1 2 123
1 3 123
1 4 123
2 1 156
2 2 .
2 3 70
2 4 70
3 1 60
3 2 .
3 3 112
3 4 112
Since you do have effectively a row number (visit), you can do this - though it's much slower than the data step.
Here it is, broken out into a separate column for demonstration purposes - of course in your case you will want to coalesce this into one column.
Basically, you need a subquery that determines the maximum visit number less than the current one that does have a legitimate weight count, and then join that to the table to get the weight.
proc sql;
select ptno, visit, weight,
(
select weight
from sda A,
(select ptno, max(visit) as visit
from sda D
where D.ptno=S.ptno
and D.visit<S.visit
and D.weight is not null
group by ptno
) V
where A.visit=V.visit and A.ptno=V.ptno
)
from sda S
;
quit;
Although you don't describe it that way you do not carry forward VISIT 1 right?
I don't know why you would want to do this using SQL. In SAS a data step is much better suited to the task. I like using the "update trick". If you're interested in how this works I will leave it to you to study the UPDATE statement.
data locf;
update sda(obs=0 keep=ptno) sda;
by ptno;
output;
if visit eq 1 then call missing(weight);
run;

SAS Proc Optmodel Constraint Syntax

I have an optimization exercise I am trying to work through and am stuck again on the syntax. Below is my attempt, and I'd really like a thorough explanation of the syntax in addition to the solution code. I think it's the specific index piece that I am having trouble with.
The problem:
I have an item that I wish to sell out of within ten weeks. I have a historical trend and wish to alter that trend by lowering price. I want maximum margin dollars. The below works, but I wish to add two constraints and can't sort out the syntax. I have spaces for these two constraints in the code, with my brief explanation of what I think they may look like. Here is a more detailed explanation of what I need each constraint to do.
inv_cap=There is only so much inventory available at each location. I wish to sell it all. For location 1 it is 800, location 2 it is 1200. The sum of the column FRC_UNITS should equal this amount, but cannot exceed it.
price_down_or_same=The price cannot bounce around, so it needs to always be less than or more than the previous week. So, price(i)<=price(i-1) where i=week.
Here is my attempt. Thank you in advance for assistance.
*read in data;
data opt_test_mkdown_raw;
input
ITM_NBR
ITM_DES_TXT $
LCT_NBR
WEEK
LY_UNITS
ELAST
COST
PRICE
TOTAL_INV;
cards;
1 stuff 1 1 300 1.2 6 10 800
1 stuff 1 2 150 1.2 6 10 800
1 stuff 1 3 100 1.2 6 10 800
1 stuff 1 4 60 1.2 6 10 800
1 stuff 1 5 40 1.2 6 10 800
1 stuff 1 6 20 1.2 6 10 800
1 stuff 1 7 10 1.2 6 10 800
1 stuff 1 8 10 1.2 6 10 800
1 stuff 1 9 5 1.2 6 10 800
1 stuff 1 10 1 1.2 6 10 800
1 stuff 2 1 400 1.1 6 9 1200
1 stuff 2 2 200 1.1 6 9 1200
1 stuff 2 3 100 1.1 6 9 1200
1 stuff 2 4 100 1.1 6 9 1200
1 stuff 2 5 100 1.1 6 9 1200
1 stuff 2 6 50 1.1 6 9 1200
1 stuff 2 7 20 1.1 6 9 1200
1 stuff 2 8 20 1.1 6 9 1200
1 stuff 2 9 5 1.1 6 9 1200
1 stuff 2 10 3 1.1 6 9 1200
;
run;
data opt_test_mkdown_raw;
set opt_test_mkdown_raw;
ITM_LCT_WK=cats(ITM_NBR, LCT_NBR, WEEK);
ITM_LCT=cats(ITM_NBR, LCT_NBR);
run;
proc optmodel;
*set variables and inputs;
set<string> ITM_LCT_WK;
number ITM_NBR{ITM_LCT_WK};
string ITM_DES_TXT{ITM_LCT_WK};
string ITM_LCT{ITM_LCT_WK};
number LCT_NBR{ITM_LCT_WK};
number WEEK{ITM_LCT_WK};
number LY_UNITS{ITM_LCT_WK};
number ELAST{ITM_LCT_WK};
number COST{ITM_LCT_WK};
number PRICE{ITM_LCT_WK};
number TOTAL_INV{ITM_LCT_WK};
*read data into procedure;
read data opt_test_mkdown_raw into
ITM_LCT_WK=[ITM_LCT_WK]
ITM_NBR
ITM_DES_TXT
ITM_LCT
LCT_NBR
WEEK
LY_UNITS
ELAST
COST
PRICE
TOTAL_INV;
var NEW_PRICE{i in ITM_LCT_WK};
impvar FRC_UNITS{i in ITM_LCT_WK}=(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i])*LY_UNITS[i];
con ceiling_price {i in ITM_LCT_WK}: NEW_PRICE[i]<=PRICE[i];
/*con inv_cap {j in ITM_LCT}: sum{i in ITM_LCT_WK}=I want this to be 800 for location 1 and 1200 for location 2;*/
con supply_last {i in ITM_LCT_WK}: FRC_UNITS[i]>=LY_UNITS[i];
/*con price_down_or_same {j in ITM_LCT} : NEW_PRICE[week]<=NEW_PRICE[week-1];*/
*state function to optimize;
max margin=sum{i in ITM_LCT_WK}
(NEW_PRICE[i]-COST[i])*(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i])*LY_UNITS[i];
/*expand;*/
solve;
*write output dataset;
create data results_MKD_maxmargin
from
[ITM_LCT_WK]={ITM_LCT_WK}
ITM_NBR
ITM_DES_TXT
LCT_NBR
WEEK
LY_UNITS
FRC_UNITS
ELAST
COST
PRICE
NEW_PRICE
TOTAL_INV;
*write results to window;
print
/*NEW_PRICE */
margin;
quit;
The main difficulty is that in your application, decisions are indexed by (Item,Location) pairs and Weeks, but in your code you have merged (Item,Location,Week) triplets. I rather like that use of the data step, but the result in this example is that your code is unable to refer to specific weeks and to specific pairs.
The fix that changes your code the least is to add these relationships by using defined sets and inputs that OPTMODEL can compute for you. Then you will know which triplets refer to each combination of (Item,Location) pair and week:
/* This code creates a set version of the Item x Location pairs
that you already have as strings */
set ITM_LCTS = setof{ilw in ITM_LCT_WK} itm_lct[ilw];
/* For each Item x Location pair, define a set of which
Item x Location x Week entries refer to that Item x Location */
set ILWperIL{il in ITM_LCTS} = {ilw in ITM_LCT_WK: itm_lct[ilw] = il};
With this relationship you can add the other two constraints.
I left your code as is, but applied to the new code a convention I find useful, especially when there are similar names like itm_lct and ITM_LCTS:
sets as all caps;
input parameters start with lowercase;
output (vars, impvars, and constraints) start with Uppercase */
Here is the new OPTMODEL code:
proc optmodel;
*set variables and inputs;
set<string> ITM_LCT_WK;
number ITM_NBR{ITM_LCT_WK};
string ITM_DES_TXT{ITM_LCT_WK};
string ITM_LCT{ITM_LCT_WK};
number LCT_NBR{ITM_LCT_WK};
number WEEK{ITM_LCT_WK};
number LY_UNITS{ITM_LCT_WK};
number ELAST{ITM_LCT_WK};
number COST{ITM_LCT_WK};
number PRICE{ITM_LCT_WK};
number TOTAL_INV{ITM_LCT_WK};
*read data into procedure;
read data opt_test_mkdown_raw into
ITM_LCT_WK=[ITM_LCT_WK]
ITM_NBR
ITM_DES_TXT
ITM_LCT
LCT_NBR
WEEK
LY_UNITS
ELAST
COST
PRICE
TOTAL_INV;
var NEW_PRICE{i in ITM_LCT_WK} <= price[i];
impvar FRC_UNITS{i in ITM_LCT_WK} =
(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i]) * LY_UNITS[i];
* Moved to bound
con ceiling_price {i in ITM_LCT_WK}: NEW_PRICE[i] <= PRICE[i];
con supply_last{i in ITM_LCT_WK}: FRC_UNITS[i] >= LY_UNITS[i];
/* This code creates a set version of the Item x Location pairs
that you already have as strings */
set ITM_LCTS = setof{ilw in ITM_LCT_WK} itm_lct[ilw];
/* For each Item x Location pair, define a set of which
Item x Location x Week entries refer to that Item x Location */
set ILWperIL{il in ITM_LCTS} = {ilw in ITM_LCT_WK: itm_lct[ilw] = il};
/* I assume that for each item and location
the inventory is the same for all weeks for convenience,
i.e., that is not a coincidence */
num inventory{il in ITM_LCTS} = max{ilw in ILWperIL[il]} total_inv[ilw];
con inv_cap {il in ITM_LCTS}:
sum{ilw in ILWperIL[il]} Frc_Units[ilw] = inventory[il];
num lastWeek = max{ilw in ITM_LCT_WK} week[ilw];
/* Concatenating indexes is not the prettiest, but gets the job done here*/
con Price_down_or_same {il in ITM_LCTS, w in 2 .. lastWeek}:
New_Price[il || w] <= New_Price[il || w - 1];*/
*state function to optimize;
max margin=sum{i in ITM_LCT_WK}
(NEW_PRICE[i]-COST[i])*(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i])*LY_UNITS[i];
expand;
solve;
*write output dataset;
create data results_MKD_maxmargin
from
[ITM_LCT_WK]={ITM_LCT_WK}
ITM_NBR
ITM_DES_TXT
LCT_NBR
WEEK
LY_UNITS
FRC_UNITS
ELAST
COST
PRICE
NEW_PRICE
TOTAL_INV;
*write results to window;
print
NEW_PRICE FRC_UNITS
margin
;
quit;