Stata: Count if non-missing across row for certain range - sum

I am trying to count the number of times an event occurred before a particular age. I have data that shows the age at each event over a lifetime (age_at_event1-age_at_event3), as well as the age at which I am no longer interested in counting such events (stop_age). I would like to create a variable (sum_event) which counts the number of times the event of interest occurred prior to the stop age. An example:
ID age_event1 age_event2 age_event3 stop_age sum_event
1 10 17 45 34 2
2 23 31 32 54 3
3 25 55 . 32 1
4 21 . . 22 1
How do I create the appropriate sum_event variable?

If you don't want to reshape your data, then you can loop over variables and count:
clear
set more off
*----- example data -----
input ///
ID age_event1 age_event2 age_event3 stop_age sum_event
1 10 17 45 34 2
2 23 31 32 54 3
3 25 55 . 32 1
4 21 . . 22 1
end
list
*----- what you want -----
gen sumevent2 = 0
foreach var of varlist age_event1 age_event2 age_event3 {
replace sumevent2 = sumevent2 + (`var' < stop_age)
}
list
For numbered variables that follow some pattern, you can try something like:
<snip>
gen sumevent2 = 0
forvalues i = 1/3 {
replace sumevent2 = sumevent2 + (age_event`i' < stop_age)
}
Another way with reshape:
*----- what you want -----
<snip>
reshape long age_event, i(ID) j(j)
bysort ID: egen sumevent2 = total(age_event < stop_age)
reshape wide // if you really need to go back to wide
list

Related

iterrows() of 2 columns and save results in one column

in my data frame I want to iterrows() of two columns but want to save result in 1 column.for example df is
x y
5 10
30 445
70 32
expected output is
points sequence
5 1
10 2
30 1
445 2
I know about iterrows() but it saved out put in two different columns.How can I get expected output and is there any way to generate sequence number according to condition? any help will be appreciated.
First never use iterrows, because really slow.
If want 1, 2 sequence by number of columns convert values to numy array by DataFrame.to_numpy and add numpy.ravel, then for sequence use numpy.tile:
df = pd.DataFrame({'points': df.to_numpy().ravel(),
'sequence': np.tile([1,2], len(df))})
print (df)
points sequence
0 5 1
1 10 2
2 30 1
3 445 2
4 70 1
5 32 2
Do this way:
>>> pd.DataFrame([i[1] for i in df.iterrows()])
points sequence
0 5 1
1 10 2
2 30 1
3 445 2

Normalisation or scaling of a column in pyspark

I want to scale a particular column in pyspark. In this case i want to do scaling in results column.My data frame looks like -
id age results
1 28 98
2 27 12
3 28 99
4 28 5
5 27 54
I have done so far -
df = spark.createDataFrame(
[(1,28,98),(2,27,12),(3,28,99),(4,28,5),(5,27,54)],
("id","age","results"))
minmax_result = df.groupBy("id").agg(min("results").alias("min_results"),max("results").alias("max_results))
final_df = minmax_result.join(df,["id"]).select(
((col("results") - col("min_results")) / col("min_results"))).alias("scaled_results"))
final_df.show()
it gives me like -
id age results scaled_results
1 28 98 null
2 27 12 null
3 28 99 null
4 28 5 null
5 27 54 null
I'm assuming you're planning to scale the column across all ids, so you won't be needing the groupby operation, unless you're going the UDF route. I'd suggest going with the following:
min = df.agg({"results": "min"}).collect()[0][0]
max = df.agg({"results": "max"}).collect()[0][0]
df_scaled = df_test.withColumn('scaled_results', (col('results') - min)/max)
I presume you're dividing each cell by the min value instead of the max value by mistake, but that might the use case as well.
you can use StandardScaler function in Pyspark Mllib something like this :
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withStd=True, withMean=False)
scalerModel = scaler.fit(new_df)
scaledData = scalerModel.transform(new_df)
Refer : https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Upvote if works

PowerPivot formula for row wise weighted average

I have a table in PowerPivot which contains the logged data of a traffic control camera mounted on a road. This table is filled the velocity and the number of vehicles that pass this camera during a specific time(e.g. 14:10 - 15:25). Now I want to know that how can I get the average velocity of cars for an specific hour and list them in a separate table with 24 rows(hour 0 - 23) where the second column of each row is the weighted average velocity of that hour? A sample of my stat_table data is given below:
count vel hour
----- --- ----
133 96.00237 15
117 91.45705 21
81 81.90521 6
2 84.29946 21
4 77.7841 18
1 140.8766 17
2 56.14951 14
6 71.72839 13
4 64.14309 9
1 60.949 17
1 77.00728 21
133 100.3956 6
109 100.8567 15
54 86.6369 9
1 83.96901 17
10 114.6556 21
6 85.39127 18
1 76.77993 15
3 113.3561 2
3 94.48055 2
In a separate PowerPivot table I have 24 rows and 2 columns but when I enter my formula, the whole rows get updated with the same number. My formula is:
=sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count] * stat_table[vel])/sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count])
Create a new calculated column named "WeightedVelocity" as follows
WeightedVelocity = [count]*[vel]
Create a measure "WeightedAverage" as follows
WeightedAverage = sum(stat_table[WeightedVelocity]) / sum(stat_table[count])
Use measure "WeightedAverage" in VALUES area of pivot Table and use "hour" column in ROWS to get desired result.

SAS Proc Optmodel Constraint Syntax

I have an optimization exercise I am trying to work through and am stuck again on the syntax. Below is my attempt, and I'd really like a thorough explanation of the syntax in addition to the solution code. I think it's the specific index piece that I am having trouble with.
The problem:
I have an item that I wish to sell out of within ten weeks. I have a historical trend and wish to alter that trend by lowering price. I want maximum margin dollars. The below works, but I wish to add two constraints and can't sort out the syntax. I have spaces for these two constraints in the code, with my brief explanation of what I think they may look like. Here is a more detailed explanation of what I need each constraint to do.
inv_cap=There is only so much inventory available at each location. I wish to sell it all. For location 1 it is 800, location 2 it is 1200. The sum of the column FRC_UNITS should equal this amount, but cannot exceed it.
price_down_or_same=The price cannot bounce around, so it needs to always be less than or more than the previous week. So, price(i)<=price(i-1) where i=week.
Here is my attempt. Thank you in advance for assistance.
*read in data;
data opt_test_mkdown_raw;
input
ITM_NBR
ITM_DES_TXT $
LCT_NBR
WEEK
LY_UNITS
ELAST
COST
PRICE
TOTAL_INV;
cards;
1 stuff 1 1 300 1.2 6 10 800
1 stuff 1 2 150 1.2 6 10 800
1 stuff 1 3 100 1.2 6 10 800
1 stuff 1 4 60 1.2 6 10 800
1 stuff 1 5 40 1.2 6 10 800
1 stuff 1 6 20 1.2 6 10 800
1 stuff 1 7 10 1.2 6 10 800
1 stuff 1 8 10 1.2 6 10 800
1 stuff 1 9 5 1.2 6 10 800
1 stuff 1 10 1 1.2 6 10 800
1 stuff 2 1 400 1.1 6 9 1200
1 stuff 2 2 200 1.1 6 9 1200
1 stuff 2 3 100 1.1 6 9 1200
1 stuff 2 4 100 1.1 6 9 1200
1 stuff 2 5 100 1.1 6 9 1200
1 stuff 2 6 50 1.1 6 9 1200
1 stuff 2 7 20 1.1 6 9 1200
1 stuff 2 8 20 1.1 6 9 1200
1 stuff 2 9 5 1.1 6 9 1200
1 stuff 2 10 3 1.1 6 9 1200
;
run;
data opt_test_mkdown_raw;
set opt_test_mkdown_raw;
ITM_LCT_WK=cats(ITM_NBR, LCT_NBR, WEEK);
ITM_LCT=cats(ITM_NBR, LCT_NBR);
run;
proc optmodel;
*set variables and inputs;
set<string> ITM_LCT_WK;
number ITM_NBR{ITM_LCT_WK};
string ITM_DES_TXT{ITM_LCT_WK};
string ITM_LCT{ITM_LCT_WK};
number LCT_NBR{ITM_LCT_WK};
number WEEK{ITM_LCT_WK};
number LY_UNITS{ITM_LCT_WK};
number ELAST{ITM_LCT_WK};
number COST{ITM_LCT_WK};
number PRICE{ITM_LCT_WK};
number TOTAL_INV{ITM_LCT_WK};
*read data into procedure;
read data opt_test_mkdown_raw into
ITM_LCT_WK=[ITM_LCT_WK]
ITM_NBR
ITM_DES_TXT
ITM_LCT
LCT_NBR
WEEK
LY_UNITS
ELAST
COST
PRICE
TOTAL_INV;
var NEW_PRICE{i in ITM_LCT_WK};
impvar FRC_UNITS{i in ITM_LCT_WK}=(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i])*LY_UNITS[i];
con ceiling_price {i in ITM_LCT_WK}: NEW_PRICE[i]<=PRICE[i];
/*con inv_cap {j in ITM_LCT}: sum{i in ITM_LCT_WK}=I want this to be 800 for location 1 and 1200 for location 2;*/
con supply_last {i in ITM_LCT_WK}: FRC_UNITS[i]>=LY_UNITS[i];
/*con price_down_or_same {j in ITM_LCT} : NEW_PRICE[week]<=NEW_PRICE[week-1];*/
*state function to optimize;
max margin=sum{i in ITM_LCT_WK}
(NEW_PRICE[i]-COST[i])*(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i])*LY_UNITS[i];
/*expand;*/
solve;
*write output dataset;
create data results_MKD_maxmargin
from
[ITM_LCT_WK]={ITM_LCT_WK}
ITM_NBR
ITM_DES_TXT
LCT_NBR
WEEK
LY_UNITS
FRC_UNITS
ELAST
COST
PRICE
NEW_PRICE
TOTAL_INV;
*write results to window;
print
/*NEW_PRICE */
margin;
quit;
The main difficulty is that in your application, decisions are indexed by (Item,Location) pairs and Weeks, but in your code you have merged (Item,Location,Week) triplets. I rather like that use of the data step, but the result in this example is that your code is unable to refer to specific weeks and to specific pairs.
The fix that changes your code the least is to add these relationships by using defined sets and inputs that OPTMODEL can compute for you. Then you will know which triplets refer to each combination of (Item,Location) pair and week:
/* This code creates a set version of the Item x Location pairs
that you already have as strings */
set ITM_LCTS = setof{ilw in ITM_LCT_WK} itm_lct[ilw];
/* For each Item x Location pair, define a set of which
Item x Location x Week entries refer to that Item x Location */
set ILWperIL{il in ITM_LCTS} = {ilw in ITM_LCT_WK: itm_lct[ilw] = il};
With this relationship you can add the other two constraints.
I left your code as is, but applied to the new code a convention I find useful, especially when there are similar names like itm_lct and ITM_LCTS:
sets as all caps;
input parameters start with lowercase;
output (vars, impvars, and constraints) start with Uppercase */
Here is the new OPTMODEL code:
proc optmodel;
*set variables and inputs;
set<string> ITM_LCT_WK;
number ITM_NBR{ITM_LCT_WK};
string ITM_DES_TXT{ITM_LCT_WK};
string ITM_LCT{ITM_LCT_WK};
number LCT_NBR{ITM_LCT_WK};
number WEEK{ITM_LCT_WK};
number LY_UNITS{ITM_LCT_WK};
number ELAST{ITM_LCT_WK};
number COST{ITM_LCT_WK};
number PRICE{ITM_LCT_WK};
number TOTAL_INV{ITM_LCT_WK};
*read data into procedure;
read data opt_test_mkdown_raw into
ITM_LCT_WK=[ITM_LCT_WK]
ITM_NBR
ITM_DES_TXT
ITM_LCT
LCT_NBR
WEEK
LY_UNITS
ELAST
COST
PRICE
TOTAL_INV;
var NEW_PRICE{i in ITM_LCT_WK} <= price[i];
impvar FRC_UNITS{i in ITM_LCT_WK} =
(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i]) * LY_UNITS[i];
* Moved to bound
con ceiling_price {i in ITM_LCT_WK}: NEW_PRICE[i] <= PRICE[i];
con supply_last{i in ITM_LCT_WK}: FRC_UNITS[i] >= LY_UNITS[i];
/* This code creates a set version of the Item x Location pairs
that you already have as strings */
set ITM_LCTS = setof{ilw in ITM_LCT_WK} itm_lct[ilw];
/* For each Item x Location pair, define a set of which
Item x Location x Week entries refer to that Item x Location */
set ILWperIL{il in ITM_LCTS} = {ilw in ITM_LCT_WK: itm_lct[ilw] = il};
/* I assume that for each item and location
the inventory is the same for all weeks for convenience,
i.e., that is not a coincidence */
num inventory{il in ITM_LCTS} = max{ilw in ILWperIL[il]} total_inv[ilw];
con inv_cap {il in ITM_LCTS}:
sum{ilw in ILWperIL[il]} Frc_Units[ilw] = inventory[il];
num lastWeek = max{ilw in ITM_LCT_WK} week[ilw];
/* Concatenating indexes is not the prettiest, but gets the job done here*/
con Price_down_or_same {il in ITM_LCTS, w in 2 .. lastWeek}:
New_Price[il || w] <= New_Price[il || w - 1];*/
*state function to optimize;
max margin=sum{i in ITM_LCT_WK}
(NEW_PRICE[i]-COST[i])*(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i])*LY_UNITS[i];
expand;
solve;
*write output dataset;
create data results_MKD_maxmargin
from
[ITM_LCT_WK]={ITM_LCT_WK}
ITM_NBR
ITM_DES_TXT
LCT_NBR
WEEK
LY_UNITS
FRC_UNITS
ELAST
COST
PRICE
NEW_PRICE
TOTAL_INV;
*write results to window;
print
NEW_PRICE FRC_UNITS
margin
;
quit;

Get quantile by groups in SAS

I have this sort of table :
Cluster Age FR
8 70 153
...
What I want is to get a table : for each Cluster and for each Age, the mean of FR in each 10th quantile. It should look like :
Cluster Age Quantile FR
1 1 10% 12
1 1 20% 14
1 1 30% 16
1 1 40% 18
1 1 50% 20
1 1 60% 22
1 1 70% 24
1 1 80% 26
1 1 90% 28
1 1 100% 30
1 2 10% 13
1 2 20% 15
1 2 30% 17
I tried doing this with proc univariate but with no success...
proc univariate data=etude.Presta_cluster_panier noprint;
var FR;
output out=pctls pctlpre=P_ pctlpts=0 to 100 by 10;
run;
This can be accomplished in two step through the use of proc rank & proc means.
proc rank data=etude.Presta_cluster_panier out=outranks groups=10;
var FR;
ranks Quantile;
by Cluster Age;
run;
proc means data=outranks;
var FR;
ways 3;
class Cluster Age Quantile;
output out=outmean;
run;
You will need to first obtain your quartiles by cluster and age. Then remerge with your master dataset, assign groups depending on your quartiles and finally compute the mean buy cluster age and quartile.
It is not possible in one step.