Suppose data set looks like:
A B C
1 2 0.2
2 7 0.3
3 10 0.7
and I want to multiply columns A and B by C and update the values? What is the most efficient way to do this?
Maybe I'm misunderstanding, but this is quite basic. Then again, basics are the most important bit.
data begin;
input A B C;
cards;
1 2 0.2
2 7 0.3
3 10 0.7
;
run;
data wanted;
set begin;
AC=A*C;
BC=B*C;
run;
/* Here is an easy example.*/
/*Your first data set*/
data fisrt;
input A B C;
datalines;
1 2 0.2
2 7 0.3
3 10 0.7
;
run;
/*The data you want to get*/
data product;
set first;
AC=A*C;
BC=B*C;
run;
Since you are just looking to update the values of a and b try this:
data product;
set first;
A=A*C;
B=B*C;
run;
alternatively, you could try:
proc sql noprint;
create table product as
select a*c as a,b*c as b,c from first;quit;
and then compare run times to see which one runs faster
Related
I need to fill the nulls of a column with the mean sum of the division of two columns An example would be
A B C ... B_01 C_01
5 . .
5 2 3
7 3 1,2
9 3 0,3
4 . .
Well, I would like the missing value for column B to be (2/5 + 3/7 + 3/9) / 3 * its corresponding column A
For column new_c (3/5 + 1,2/7 + 0,3/9)/3 * its corresponding column A
I have thought about doing this, but it turns out that I have 60 columns with which to do it and the only way it comes to mi mind is to do this 60 times.
Proc sql;
create table new as
Select *
, sum(B/A)/sum(case when B is missimg then . else 1) end as new_B
, sum(C/A)/sum(case when C is missimg then . else 1) end as new_C_01
from table_one
;
Thanks
PROC SQL should be able to do that easily.
First let's convert your data listing into an actual dataset.
data have;
input A B C ;
cards;
5 . .
5 2 3
7 3 1.2
9 3 0.3
4 . .
;
Now let's use it to create a new version of B that follows your rules.
proc sql;
create table want as
select *,coalesce(b,a*mean(b/a)) as new_b
from have
;
quit;
Results:
OBS A B C new_b
1 5 . . 1.93651
2 5 2 3.0 2.00000
3 7 3 1.2 3.00000
4 9 3 0.3 3.00000
5 4 . . 1.54921
You can use Proc MEANS to compute the mean fraction of each of the 60 columns, and apply the imputation rule in a DATA step.
Example:
data have;
call streaminit(20230129);
do row = 1 to 100;
a = rand('integer', 30);
array x x1-x60;
do over x;
x = ifn(rand('uniform') > 0.30, rand('integer', a-1), .);
end;
output;
end;
run;
data fractions;
set have;
array f x1-x60;
do over f;
if not missing(f) then f = f / a;
end;
rename x1-x60 = f1-f60;
run;
proc means noprint data=fractions;
output out=means mean(f1-f60)=mean1-mean60;
var f1-f60;
run;
data want;
set have;
one = 1;
set means point=one;
array means mean1-mean60;
array x x1-x60;
do over x;
if missing (x) then means = means * a; else means = x;
end;
rename mean1-mean60=new_x1-new_x60;
run;
I have a dataframe
ID value1
1 12
2 345
3 342
i have a second dataframe
value2
3823
how do I get the following result?
ID value1 value2
1 12 3823
2 345 3823
3 342 3823
any joins I have done have given me
ID value1 value2
1 12 .
2 345 .
3 342 .
. . 3823
No need for joins or helper variables:
data have;
do i = 1 to 3;
output;
end;
run;
data lookup;
j = 1;
run;
data want;
set have;
if _n_ = 1 then set lookup;
run;
Without the if _n_ = 1, the data step stops after one iteration when it tries to read a second row from the lookup dataset and finds that there are no rows remaining.
N.B. this requires that the have dataset doesn't already contain a variable with the same name as the variable(s) attached from the lookup dataset.
By far the easiest way to do this is to utilize PROC SQL and defining the condition 1=1, which is always true for each comparison:
data first;
input ID value1 ##;
cards;
1 12 2 345 3 342
run;
data second;
input value2 ;
cards;
3823
run;
proc sql;
create table wanted as
select * from first
left join second
on 1 =1
;quit;
Edit: As far as I know, there isn't direct way to merge datasets by each row, but you can do the following trick:
Add variable Help:
data second_trick;
set second;
help=1;
run;
data first_trick;
set first;
help=1;
run;
Then we just perform the merge by the static variable:
data wanted_trick;
merge first_trick(in=a) second_trick;
by help;
if a; /*Left join, just to be sure.*/
run;
now this only works if you want to add single static value. Don't try to use it your Second set has more rows.
For more on Merges and joins see: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/249-30.pdf
In SAS, is there an easy way to extract records from a data set that have more than 2 occurrences.
The DUPS command gives duplicates, but how to get triplicates and higher?
For example, in this dataset:
col1 col2 col3 col4 col5
1 2 3 4 5
1 2 3 5 7
1 2 3 4 8
A B C D E
A B C S W
The first 3 columns are my key columns. So in my output, I only want first 3 rows(triplicates) but not last 2 rows (duplicates)
I would use proc sql for this, taking advantage of the group by and having clauses. Even though it's one step of code, it does require 2 passes of the data in the background, however I believe this needs to be the case whichever method you use.
data have;
input col1 $ col2 $ col3 $ col4 $ col5 $;
datalines;
1 2 3 4 5
1 2 3 5 7
1 2 3 4 8
A B C D E
A B C S W
;
run;
proc sql;
create table want as
select * from have
group by col1,col2,col3
having count(*)>2;
quit;
You can achieve this using proc sql pretty easily. The below example will keep all rows from the table that are triplicates (or higher).
Create some sample data:
data have;
input col1 $
col2 $
col3 $
col4 $
col5 $
;
datalines;
1 2 3 4 5
1 2 3 5 7
1 2 3 4 8
A B C D E
A B C S W
;
run;
First identify the triplicates. I'm assuming you want triplicates (or above), and that you're grouping on the first 3 columns:
proc sql noprint;
create table tmp as
select col1, col2, col3, count(*)
from have
group by 1,2,3
having count(*) ge 3
;
quit;
Then use the tmp table we just created to filter against the original dataset via a join.
proc sql noprint;
create table want as
select a.*
from have a
join tmp b on b.col1 = a.col1
and b.col2 = a.col2
and b.col3 = a.col3
;
quit;
These 2 steps could be combined into a single step with a subquery if you desired but I'll leave that up to you.
EDIT : Keith's answer provides a shorthand way to combine these 2 steps into a single step.
This is the data step solution: a double DoW loop, first counting the rows then outputting when >= 3. This actually only passes through the data once after the sort (a double DoW loop lists two reads, but they're buffered so the file is only read once).
This solution runs slightly faster on my machine than the SQL solution; not by much, but slightly. On a dataset with 12.5MM rows, of which about 1/4 are in the final triplicate+ dataset, total time (sort plus datastep) is about 10s real/13.5s CPU, while Keith's SQL is 12.5 real/15 CPU. I suspect this solution is always slightly faster than the SQL solution, but in a lot of cases is not much faster, unless you can skip the sort but the dataset isn't actually sorted (SQL will only skip the sort if the dataset is marked as sorted).
The NOUNIQUEKEY only works in 9.3+, and is sometimes helpful - if you have a lot of unique records and only few duplicates (with 2+ rows), it will speed up the final read. It will take longer to do the sort, though, so it's only worth it if that final read speed-up is helpful. In the below it's very slightly helpful, but not by much; it turns execution from 5s/5s RT to 6s/3s, but at the same CPU time - 8/5.5 (normal) vs 10.5/3 (nounique).
data have;
input col1 $ col2 $ col3 $ col4 $ col5 $;
datalines;
1 2 3 4 5
1 2 3 5 7
1 2 3 4 8
A B C D E
A B C S W
;;;;
run;
proc sort nouniquekey data=have out=have_dups; *nouniquekey is 9.3+, helps optimize sometimes- it is optional;
by col1 col2 col3;
run;
data want;
do _n_=1 by 1 until (last.col3);
set have_dups;
by col1 col2 col3;
count_row=sum(count_row,1);
end;
do _n_=1 by 1 until (last.col3);
set have_dups;
by col1 col2 col3;
if count_row ge 3 then output;
end;
run;
I have been dealing with this issue that I thought was trivial, but for some reason nothing I have tried has worked so far.
I have a dataset
obs A B C
1 2 6 7
2 3 1 5
3 8 5 9
. . . .
For each observation, I want to compare the values in column A to the values in column B and assign a value 1 to a variable called within. My goal to only select observations where their A value is within their B and C values. I have tried everything, but nothing seem to be working.
Thank you.
Here's how to do it in a data step. Let me know if that works for you.
data new;
set old;
if B < A < C then D = 1;
else delete;
run;
I would like to create a table that lists the frequency of each variables frequencies. For example, a data set with 100 rows and 4 variables: ID, A, B, and C.
What I'm looking for would be like this:
Freqs| ID A B C
----------------------------
1 | 100 20 15 10
2 | 0 40 35 0
3 | 0 0 5 30
Since there are 100 unique IDs, there will be a frequency of 100 frequencies of 1 from the original data.
edit for clarification:
If you did a proc freq on the original data, you would get a frequency of 1 for every ID. Then if you did a proc freq on the count, you would have a frequency of 100 for counts of 1. I'm looking for that for every variable in a data set.
This should do what you want. You probably want to process the preds table since it contains "Table" in each table name, but this is a pretty simple way to do this.
ods output onewayfreqs=preds;
proc freq data=sashelp.class;
tables _all_;
run;
ods output close;
proc tabulate data=preds;
class table frequency;
tables frequency,table;
run;