Diabetes: Prediction using discrimination on PCA - sql

I have a table:
col1 col2
2 20
2.5 25
2.67 30
2.99 40
I'm looking to get
varone = 2 x col2, vartwo= 2.5 x col2, varthree= 2.67 x col3, varfour=2.99 x col2
i.e. extracting a specific value from a table
and then multiplying an entire column by that value (scalar x vector).
I tried transposing col1
col1a col1b col1c col1d col2
2 2.5 2.67 2.99 20
25
30
40
and then tried multiplying col1a x col2, but it didn't seem to work.

In SAS, you can just use proc sql:
proc sql;
select 2*col2 as varone, 2.5*col2 as vartwo, 2.67*col3 as varthree, 2.99*col2 as varfour
from atable;

Assuming you're using SAS and either PROC FACTOR or PROC PRINCOMP then you can use PROC SCORE.
Example straight from the documentation:
http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_score_sect017.htm
/* This data set contains only the first 12 observations */
/* from the full data set used in the chapter on PROC REG. */
data Fitness;
input Age Weight Oxygen RunTime RestPulse RunPulse ##;
datalines;
44 89.47 44.609 11.37 62 178 40 75.07 45.313 10.07 62 185
44 85.84 54.297 8.65 45 156 42 68.15 59.571 8.17 40 166
38 89.02 49.874 9.22 55 178 47 77.45 44.811 11.63 58 176
40 75.98 45.681 11.95 70 176 43 81.19 49.091 10.85 64 162
44 81.42 39.442 13.08 63 174 38 81.87 60.055 8.63 48 170
44 73.03 50.541 10.13 45 168 45 87.66 37.388 14.03 56 186
;
proc factor data=Fitness outstat=FactOut
method=prin rotate=varimax score;
var Age Weight RunTime RunPulse RestPulse;
title 'Factor Scoring Example';
run;
proc print data=FactOut;
title2 'Data Set from PROC FACTOR';
run;
proc score data=Fitness score=FactOut out=FScore;
var Age Weight RunTime RunPulse RestPulse;
run;
proc print data=FScore;
title2 'Data Set from PROC SCORE';
run;

You can make use of array to achieve this.
Below program is dynamic. It will work for any number of observations.
****data we have****;
data have;
input col1 col2;
datalines;
2 20
2.5 25
2.67 30
2.99 40
;
run;
****Taking Count****;
****Creating macro "value" to store col1 data****;
proc sql ;
select count(*) into :cnt_rec from have;
select col1 into :value1 - :value&SysMaxLong from have;
quit;
data want(drop=i);
set have;
array NewColumn(&cnt_rec);
****processing the array and multiplying col2 data****;
do i = 1 to &cnt_rec;
NewColumn[i] = symget('value'||left(i)) * col2;
end;
run;

Related

Segregate dataset based on certain matching variables

I have 2 datasets , one is base dateset and the other is subset of it , I want to create a dataset where the record is not present in the subset dataset but present in base dataset. So if combination of acct_num test_id trandate actual_amt is not present in the subset then it should come in the resultant dataset.
DATA base;
INPUT acct_num test_id tran_date:anydtdte. actual_amt final_amt final_amt_added ;
format tran_date date9.;
DATALINES;
55203610 2542 12-jan-20 30 45 45
16124130 8062 . 56 78 78
16124130 8062 14-dec-19 8 78 78
80479512 2062 19-mar-19 32 32 32
70321918 2062 20-dec-19 1 93 54
17312410 6712 . 45 90 90
17312410 6712 15-jun-18 0 90 90
74623123 2092 17-aug-18 34 87 87
24245321 2082 22-jan-17 22 56 67
;
run;
data subset;
input acct_num test_id tran_date:anydtdte. actual_amt final_amt final_amt_added ;
format tran_date date9.;
DATALINES;
55203610 2542 12-jan-20 30 45 45
16124130 8062 . 56 78 78
16124130 8062 14-dec-19 8 78 78
17312410 6712 . 45 90 90
74623123 2092 17-aug-18 34 87 87
24245321 2082 22-jan-17 22 56 67
;
run;
data that I want
80479512 2062 19-mar-19 32 32 32
70321918 2062 20-dec-19 1 93 54
17312410 6712 15-jun-18 0 90 90
I have tried using not in function in SQL but it does not match multiple variable in that statement.
Any help will be appreciated.
It is about how to solve minus set, see Except operator
proc sql noprint;
create table want as
select * from base
except
select * from subset
;
quit;
Make a list of all the observed values in subset, then simply merge the base file with the combinations found in subset and output the records that are in base only.
Note it is important to restrict subset_combinations to non-duplicates and only keep the sorting variables, else you may overwrite the values from subset.
proc sort data=base;
by acct_num test_id tran_date actual_amt;
proc sort data=subset out=subset_combinations (keep=acct_num test_id tran_date actual_amt) nodupkey;
by acct_num test_id tran_date actual_amt;
data want;
merge base (in=in1) subset_combinations (in=in2);
by acct_num test_id tran_date actual_amt;
if in1 & ^in2;
run;

Conditional Subtraction within the same column using SAS

here is a portion of a data set I have named “antibody” :
Row Subject Type Procedure Measurement Output
1 500 Intial Invasive 20 20
2 500 Initial Surface 35 35
3 500 Followup Invasive 54 54-20
4 428 Followup Outer 29 29-10
5 765 Seventh Other 13 13-19
6 500 Followup Surface 98 98-35
7 428 Initial Outer 10 10
8 765 Initial Other 19 19
9 610 Third Invasive 66 66-17
10 610 Initial Invasive 17 17
I was trying to use proc sql to perform this. The goal is to subtract the numbers in the "MEASUREMENT" column based on the "SUBJECT", "TYPE" and “PROCEDURE” columns. If two values in the “SUBJECT” column match and two values in the “PROCEDURE” column match, then the initial measurement should be subtracted from the other measurement. For example, the initial measurement in row 1 (20) should be subtracted from the followup measurement in row 3 (54) because the subject (500) and procedure (Invasive) match. Furthermore, the initial measurement in row 8 (19) should be subtracted from the seventh measurement in row 5 (13) because the subject (765) and procedure (Other) match. The result should form the "OUTPUT" column.
Thank you in advance!
Here is a hash object approach
data have;
input Subject Type $ 5-12 Procedure $ 15-22 Measurement;
datalines;
500 Initial Invasive 20
500 Initial Surface 35
500 Followup Invasive 54
428 Followup Outer 29
765 Seventh Other 13
500 Followup Surface 98
428 Initial Outer 10
765 Initial Other 19
610 Third Invasive 66
610 Initial Invasive 17
;
data want (drop=rc _Measurement);
if _N_ = 1 then do;
declare hash h (dataset : "have (rename=(Measurement=_Measurement) where=(Type='Initial'))");
h.definekey ('Subject');
h.definedata ('_Measurement');
h.definedone();
end;
set have;
_Measurement=.;
if Type ne 'Initial' then rc = h.find();
Measurement = sum (Measurement, -_Measurement);
run;
EDIT:
data want (drop=rc _Measurement);
if _N_ = 1 then do;
declare hash h (dataset : "have (rename=(Measurement=_Measurement) where=(Type='Initial'))");
h.definekey ('Subject');
h.definedata ('_Measurement');
h.definedone();
end;
set have;
_Measurement=.;
if Type ne 'Initial' then rc = h.find();
NewMeasurement = ifn(Measurement=., ., sum (Measurement, -_Measurement));
run;

how to concat corresponding rows value to make column name in pandas?

I have the below dataframe has in a messy way and I need to club row 0 and 1 to make that as columns and keep rest rows from 3 asis:
Start Date 2005-01-01 Unnamed: 3 Unnamed: 4 Unnamed: 5
Dat an_1 an_2 an_3 an_4 an_5
mt mt s t inch km
23 45 67 78 89 9000
change to below dataframe :
Dat_mt an_1_mt an_2 _s an_3_t an_4_inch an_5_km
23 45 67 78 89 9000
IIUC
df.columns=df.loc[0]+'_'+df.loc[1]
df=df.loc[[2]]
df
Out[429]:
Dat_mt an_1_mt an_2_s an_3_t an_4_inch an_5_km
2 23 45 67 78 89 9000

How do I show a new column with values in sql select query?

There is no primary key in the table.
My table is like
col1 col2 col3
12 34 35
56 34 35
13 56 35
56 34 35
12 56 34
I want a query which displays an extra column like this :
col0 col1 col2 col3
rox 12 34 35
max 56 34 35
bill 13 56 35
rox 56 34 35
sam 12 56 34
To add a new column
alter table your_table
add col0 varchar2(10);
"There is no primary key in the table."
So that means means the relationship between value in col0 and the existing columns is essentially random, which gives us a free hand in populating the new column.
Here is the test data:
SQL> select * from your_table;
COL0 COL1 COL2 COL3
---------- ---------- ---------- ----------
12 34 35
56 34 35
13 56 35
56 34 35
12 56 34
SQL>
Here is a PL/SQL routine which uses bulk operations to update all the rows in the table.
SQL> declare
2 -- this collection type will be available in all Oracle databases
3 new_col sys.dbms_debug_vc2coll :=
4 sys.dbms_debug_vc2coll('rox','max','bil','rox','sam');
5 your_table_rowids sys.dbms_debug_vc2coll;
6 begin
7 -- note: solution fits given data, so does not bother to handle large volumnes
8 select rowidtochar(yt.rowid)
9 bulk collect into your_table_rowids
10 from your_table yt;
11
12 -- NB: this is not a loop, this is a bulk operation
13 forall idx in your_table_rowids.first()..your_table_rowids.last()
14 update your_table yt
15 set yt.col0 = new_col(idx)
16 where yt.rowid = chartorowid(your_table_rowids(idx))
17 ;
18 end;
19 /
PL/SQL procedure successfully completed.
SQL>
The nonsense with ROWID is necessary because your table has no primary key and duplicate rows.
Anyway, here's the outcome:
SQL> select * from your_table
2 /
COL0 COL1 COL2 COL3
---------- ---------- ---------- ----------
rox 12 34 35
max 56 34 35
bil 13 56 35
rox 56 34 35
sam 12 56 34
SQL>
Having written an answer which uses DDL and UPDATE to change the database state I read the question title properly 8-) So here is a query which selects a pseudo-column COL0.
With this the test data:
SQL> select * from your_table
2 /
COL1 COL2 COL3
---------- ---------- ----------
12 34 35
56 34 35
13 56 35
56 34 35
12 56 34
SQL> with nt as (
2 select column_value as col0
3 , rownum as rn
4 from
5 table(sys.dbms_debug_vc2coll('rox','max','bil','rox','sam'))
6 )
7 select nt.col0
8 , yt.col1
9 , yt.col2
10 , yt.col3
11 from nt
12 join ( select t.*
13 , rownum as rn
14 from your_table t) yt
15 on yt.rn = nt.rn
16 /
COL0 COL1 COL2 COL3
---------- ---------- ---------- ----------
rox 12 34 35
max 56 34 35
bil 13 56 35
rox 56 34 35
sam 12 56 34
SQL>

How do I merge by more than one variable using proc SQL is SAS

I have 2 datasets in SAS:
main_1
ID Rep Dose Response
1 2 34 567
1 1 45 756
2 1 35 456
3 1 56 345
main_2
ID Rep Hour Day
1 1 89 157
2 1 62 365
3 1 12 689
I can easily merge these 2 datasets first by ID and then by Rep (as one of the ID's has two observations) with the following code in SAS:
proc import out=main_1
datafile='/folders/myfolders/sasuser.v94/main_1.xls'
dbms=xls replace;
/*optional*/
sheet='Sheet1';
getnames=yes;
run;
proc import out=main_2
datafile='/folders/myfolders/sasuser.v94/main_2.xls'
dbms=xls replace;
/*optional*/
sheet='Sheet1';
getnames=yes;
run;
/*merge datasets based on common variable (ID then Rep)*/
/*first sort all datasets by target variables*/
proc sort data=main_1;
by ID Rep;
proc sort data=main_2;
by ID Rep;
run;
/*can now be merged*/
data main_merge;
merge main_1 main_2;
by ID Rep;
run;
this produces the following table:
ID Rep Dose Response Hour Day
1 1 45 756 89 157
1 2 34 567 . .
2 1 35 456 62 365
3 1 56 345 12 689
I currently have the following proc SQL alternative (I am learning so sorry of its terrible) but cannot seem to merge by more than 1 variable (i.e. ID and Rep):
proc sql;
create table merged_sql as
select L.*, R.*
from main_1 as L
LEFT JOIN main_2 as R
on L.ID = R.ID;
quit;
producing the following:
ID Rep Dose Response Hour Day
1 2 34 567 89 157
1 1 45 756 89 157
2 1 35 456 62 365
3 1 56 345 12 689
Any suggestion on a proc SQL code to achieve the same table as previously? My current code adds the '89 157' to both ID=1 observations.
Many thanks.
You're almost there...
proc sql;
create table merged_sql as
select L.*,
R.HOUR,
R.DAY
from main_1 as L
LEFT JOIN main_2 as R
on L.ID = R.ID
and L.REP = R.REP;
quit;
The reason not to use R.* is to avoid a note or warning about having duplicate ID and REP fields.