I would like to Sum, for each Codinv and Class, values of previous years listed in column D.
Thank you Rigerta. Here is my New Request. Now that I think about that, when there is just one row per CodInv per Class, it should show the same value as D. Hence, I would like a new column to be calculated as follows
Codinv Class year D NewColumn
----------------------------------------------------------
13 C08F 1977 5 5
76 C01B 1999 1 1
76 C21D 2005 2 2
76 C23C 1998 2 2
76 C23C 1999 2 4
I would change the code as follows, but it still does not work
As I read online, I tried with
data Want;
set Have;
by Codinv Class year;
retain NewColumn;
if first.Class then NewColumn=D; output;
if last.year NewColumn=NewColumn+D;
run;
It worked well with another analysis I had to do where I sorted by Codinv and Year only, now that I am doing it with three I tried different variations, but it is showing missing data for all rows or 0... Can you help me out? Forever Grateful
You're close with your attempt, I've modified it to produce the desired output. A summary of the changes I've made are :
Removed the retain statement. The method I've used adopts an automatic retain, so isn't necessary.
Initialise NewColumn to 0 each time the class changes
Add D to NewColumn for each row. (x+y, as used here, creates an implied retain)
Removed the output statement. This is implied at the end of the data step, so isn't needed.
Removed the if last.year... line as it is not necessary
Strictly speaking, having year in the by statement isn't necessary, but it is useful to keep in to ensure the data is sorted properly.
data have;
input Codinv Class $ year D;
datalines;
13 C08F 1977 5
76 C01B 1999 1
76 C21D 2005 2
76 C23C 1998 2
76 C23C 1999 2
;
run;
data Want;
set Have;
by Codinv Class year;
if first.Class then NewColumn=0;
newcolumn+D;
run;
Related
Simplified version of the dataset I have is:
DATA HAVE;
INPUT ID match1 $ match2 $ not_relevant;
DATALINES;
1 "ABC" "ABC" 4
1 "XYZ" "XYZ" 29
2 "QQQ" "AAA" 5
2 "ABC" "ABC" 9
3 "EFG" "EFG" 7
3 "DEF" "DEF" 12
3 "LMK" LMK" 16
3 "LMK" . 29
;RUN;
I am looking to compare match1 and match2, and if anywhere in the ID column match1 does not equal match2, I would like to remove all of the rows with that ID. So for this example dataset I want to remove all of ID 2 (rows 3 and 4) since row 3 does not have a match between match1 and match2. All I can figure out how to do so far is to delete the rows where they dont match, which isnt terribly helpful for this application. I assume it would be easier to make it a new data set with some wheres but I am unsure how to begin there. Any ideas / advice?
EDIT:
Apologies, I dumbed down my dataset too much and forgot about an important exception. Note in my new dataset (I only added one row to the end). I do NOT want to delete group 3, since match2 is blank. I only want to delete a group where match2 is not blank and match1 does not equal match2.
Thanks
There's a few ways to do this. One would be to just construct a dataset of IDs that have non-matching rows, then do a merge or a SQL join and remove anything that matched this list.
However, my preferred option (partly because of speed, but also it's more straightforward once you understand how it works) is the DoW loop.
data want;
id_nonmatch = 0;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if match1 ne match2 then id_nonmatch = 1; *set the flag to 1 if we find a nonmatch;
end;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if id_nonmatch = 0 then output;
end;
run;
There are two set statements on the data step, each of which runs through the same dataset separately. If it doesn't make sense, throw a put _all_; inside each of the do loops - that will show you what it's doing. The first loop goes over all of the rows for one ID, checks if any violate the constraint, and if none do, the flag variable (id_nonmatch) stays 0. If one does, it becomes a 1 (and stays that way). Then, when it hits an ID boundary, it stops pulling records from the first set statement, and goes onto the second - re-pulling those same rows. Now, it outputs only when the flag is a zero.
This is very efficient because of buffering - unless your id groups are very large, the data step may be able to use buffers to keep the same rows in memory and not have to reread them from disk. (This will depend on your disk and buffers - and seems to help much less on flash than on physical disks [since there is not the additional benefit of the disk head not having to move] - so your mileage may vary here.)
Just to show this difference, here is a log showing that there isn't much additional time needed for the second read - when the record is reasonably sized. This benefit is less when the record is very small - I imagine there is more overhead involved. Note that the second read adds only 1/7 of the time of the first read to the total processing time!
69 data have;
70 call streaminit(7);
71 length strvar $1000;
72 do id = 1 to 100000;
73 do iter = 1 to 50;
74 x = rand('Uniform');
75 output;
76 end;
77 end;
78 run;
NOTE: Variable strvar is uninitialized.
NOTE: The data set WORK.HAVE has 5000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 5.20 seconds
cpu time 5.20 seconds
79
80
81 data _null_;
82 do _n_ = 1 by 1 until (last.id);
83 set have;
84 by id;
85 end;
86 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.37 seconds
cpu time 2.37 seconds
87
88
89 data _null_;
90 do _n_ = 1 by 1 until (last.id);
91 set have;
92 by id;
93 end;
94 do _n_ = 1 by 1 until (last.id);
95 set have;
96 by id;
97 end;
98 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.74 seconds
cpu time 2.73 seconds
It is easy to do this with an SQL query with a GROUP BY and HAVING clause.
proc sql;
create table want as
select *
from have
group by id
having max( (match1 ne match2) and not missing(match2))
;
quit;
SAS evaluates boolean expressions as 1/0 for TRUE/FALSE so the MAX() of a series of TRUE/FALSE values will be TRUE if ANY of them are TRUE.
So maybe I'm just way over-thinking things, but is there any way to replicate a nested/loop calculation in Vertica with just SQL syntax.
Explanation -
In Column AP I have remaining values per month by an attribute key, in column CHANGE_1M I have an attribution value to apply.
The goal is for future values to calculate the preceding Row partition AP*CHANGE_1M, by the subsequent row partition CHANGE_1M to fill in the future AP values.
For reference I have 15,000 Keys Per Period and 60 Periods Per Year in the full-data set.
Sample Calculation
Period 5 =
(Period4_AP * Period5_CHANGE_1M)+Period4_AP
Period 6 =
(((Period4_AP * Period5_CHANGE_1M)+Period4_AP)*Period6_CHANGE_1M)
+
((Period4_AP * Period5_CHANGE_1M)+Period4_AP)
ect.
Sample Data on Top
Expected Results below
Vertica does not have (yet?) the RECURSIVE WITH clause, which you would need for the recursive calculation you seem to be needing here.
Only possible workaround would be tedious: write (or generate, using perl or Python, for example) as many nested queries as you need iterations.
I'll only want to detail this if you want to go down that path.
Long time no see - I should have returned to answer this question earlier.
I got so stuck on thinking of the programmatic way to solve this issue, I inherently forgot it is a math equation, and where you have math functions you have solutions.
Basically this question revolves around doing table multiplication.
The solution is to simply use LOG/LN functions to multiply and convert back using EXP.
Snippet of the simple solve.
Hope this helps other lost souls, don't forget your math background and spiral into a whirlpool of self-defeat.
EXP(SUM(LN(DEGREDATION)) OVER (ORDER BY PERIOD_NUMBER ASC ROWS UNBOUNDED PRECEDING)) AS DEGREDATION_RATE
** Controlled by what factors/attributes you need the data stratified by with a PARTITION
Basically instead of starting at the retention PX/P0, I back into with the degradation P1/P0 - P2/P1 ect.
PERIOD_NUMBER
DEGRADATION
DEGREDATION_RATE
DEGREDATION_RATE x 100000
0
100.00%
100.00%
100000.00
1
57.72%
57.72%
57715.18
2
60.71%
35.04%
35036.59
3
70.84%
24.82%
24820.66
4
76.59%
19.01%
19009.17
5
79.29%
15.07%
15071.79
6
83.27%
12.55%
12550.59
7
82.08%
10.30%
10301.94
8
86.49%
8.91%
8910.59
9
89.60%
7.98%
7984.24
10
86.03%
6.87%
6868.79
11
86.00%
5.91%
5907.16
12
90.52%
5.35%
5347.00
13
91.89%
4.91%
4913.46
14
89.86%
4.41%
4414.99
15
91.96%
4.06%
4060.22
16
89.36%
3.63%
3628.28
17
90.63%
3.29%
3288.13
18
92.45%
3.04%
3039.97
19
94.95%
2.89%
2886.43
20
92.31%
2.66%
2664.40
21
92.11%
2.45%
2454.05
22
93.94%
2.31%
2305.32
23
89.66%
2.07%
2066.84
24
94.12%
1.95%
1945.26
25
95.83%
1.86%
1864.21
26
92.31%
1.72%
1720.81
27
96.97%
1.67%
1668.66
28
90.32%
1.51%
1507.18
29
90.00%
1.36%
1356.46
30
94.44%
1.28%
1281.10
31
94.12%
1.21%
1205.74
32
100.00%
1.21%
1205.74
33
90.91%
1.10%
1096.13
34
90.00%
0.99%
986.52
35
94.44%
0.93%
931.71
36
100.00%
0.93%
931.71
I need to compare the last row of a group to the row above it, see if changes occur in a few columns, and populate a new column with 1 if a change occurs. The data presentation below will explain better.
Also need to account for having a group with only 1 row.
what we have:
Group Name Sport DogName Eligibility
1 Tom BBALL Toto Yes
1 Tom BBall Toto Yes
1 Tom golf spot Yes
2 Nancy vllyball Jimmy yes
2 Nancy vllyball rover no
what we want:
Group Name Sport DogName Eligibility N_change S_change D_Change E_change
1 Tom BBALL Toto Yes 0 0 0 0
1 Tom BBall Toto Yes 0 0 0 0
1 Tom golf spot Yes 0 1 1 0
2 Nancy vllyball Jimmy yes 0 0 0 0
2 Nancy vllyball rover no 0 0 1 1
Only care about changes from row to row within group. Thank you for any help in advance.
The rows are already ordered so we only need to last two of the group. If it is easier to compare sequential rows in a group then that is just as good for my purposes.
I did know this would be arrays and I struggle with these because never use them for my typical sas modeling. Wanted to keep things short and sweet.
Use the data step and lag statements. Ensure your data is sorted by group first, and that the rows within groups are sorted in the correct order. Using arrays will make your code much smaller.
The logic below will compare each row with the previous row. A flag of 1 will be set only if:
It's not the first row of the group
The current value differs from the previous value.
The syntax var = (test logic); is a shortcut to automatically generate dummy flags.
data want;
set have;
by group;
array var[*] name sport dogname eligibility;
array lagvar[*] $ lag_name lag_sport lag_dogname lag_eligibility;
array changeflag[*] N_change S_change D_change E_change;
do i = 1 to dim(var);
lagvar[i] = lag(var[i]);
changeflag[i] = (var[i] NE lagvar[i] AND NOT first.group);
end;
drop lag: i;
run;
It is not uncommon for procedural programmers to find this kind of dilemma in SQL, which is predominately a set language where rows have no position. If you write a procedure that reads the select data (sorted in the desired order), it can have variables to control creating the desired additional columns in the output, similar to the lag function above.
Or you can put it into a spreadsheet, which is happier detecting the changes in formula filled columns =if(a2<>a1,1,0). Just make sure nobody re-sorts the spreadsheet data into a new order!
Good morning.
I've this dataset:
Appendix | Change_Serial_Number| Status | Duration | Mileage | Service
20101234 0 . 60 120000 Z
20101234 1 Proposed 48 110000 Z
20101234 2 Activated 24 90000 Z
20101234 3 Proposed 60 120000 Z
20101234 4 Proposed 50 160000 B
20101234 5 Activated 36 110000 B
Each row is a variation that could be activated or only proposed with the first row with status like blank or the previously activated variation.
I need to have this table:
Appendix | Change_Serial_Number| Status | Duration | Mileage | Service |Duration_Prev| Mileage_prev |
20101234 0 . 60 120000 Z .
20101234 1 Proposed 48 110000 Z 60 120000
20101234 2 Activated 24 90000 Z 60 120000
20101234 3 Proposed 60 120000 Z 24 90000
20101234 4 Proposed 50 160000 B 24 90000
20101234 5 Activated 36 110000 B 24 90000
I need to compare the duration, mileage and service of each variation with the previously activated or with the initial condition only if there aren't variation activated.
I tried with lag function to retrieve a data of previous row, but i need to retrieve data of 3 field and retrieve data only from the last activated variation or, if there aren't, from the initial condition.
I used this code:
proc sort data=db_rdg;
by Appendix Change_Serial_Number descending Change_Serial_Number;
run;
data db_rdg2;
set db_rdg;
by Appendix;
Duration_prev=lag(Duration);
if first. Appendix then Durata_prev =.;
run;
With this code, i can retrieve a data only from the previously row (not from the previosly actived row or from the first condition) and only for a duration variable (not at the same time for duration, mileage and service).
I hope I have been clear enough :)
Thank you for your help!
The lag() function is only really useful for working with values from a specific number of observations earlier. In this case, you don't know whether the values you want to work with are from the previous observation or from five or six observations earlier, so instead of using lag(), you should RETAIN the additional variables and update their values when appropriate:
data db_rdg2;
retain duration_prev .;
set db_rdg;
by Appendix;
if first.Appendix or status = 'Activated' then duration_prev = duration;
run;
The RETAIN statement allows duration_prev to retain its value as each new observation in read from the input, instead of being reset to missing.
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214163.htm
Instead of using LAG to retrieve the duration from the prior row, you will want to store the activate state tracking variables (for duration, mileage and serial) in a variable that is retained and updated after an explicit output.
In these two sample codes I tossed in tracking serial as you may want to know # of changes from prior activate.
data have; input
Appendix Change_Serial_Number Status $ Duration Mileage Service $;
datalines;
20101234 0 . 60 120000 Z
20101234 1 Proposed 48 110000 Z
20101234 2 Activated 24 90000 Z
20101234 3 Proposed 60 120000 Z
20101234 4 Proposed 50 160000 B
20101234 5 Activated 36 110000 B
run;
* NOTE: _APA suffix means # prior activate;
* version 1;
* implicit loop with by group processing means ;
* explicit first. test needed in order to reset the apa tracking variables;
data want;
set have;
by appendix;
if first.appendix then do;
length csn_apa dur_apa mil_apa 8;
call missing(csn_apa, dur_apa, mil_apa);
end;
output;
if status in (' ' 'Activate') then do;
csn_apa = change_serial_number;
dur_apa = duration;
mil_apa = mileage;
end;
retain csn_apa dur_apa mil_apa;
run;
* version 2;
* DOW version;
* explicit loop over group means first. handling not explicitly needed;
* implicit loop performs tracking variable resets;
* retain not needed because output and tracking variables modified;
* within current iteration of implicit loop;
data want2;
do until (last.appendix);
set have;
by appendix;
output;
if status in (' ' 'Activate') then do;
csn_apa = change_serial_number;
dur_apa = duration;
mil_apa = mileage;
end;
end;
run;
I wasnt able to find anything like this yet... but here is what i need to do:
I have a query result like this:
ID Data1 Data2 Data3 Data4 ... Data7
1 12 13 15 1 ... 12
2 12 13 15 1 ... 12
3 12 13 15 1 ... 12
4 12 13 15 1 ... 12
I need to make a BarChart With 2 Values, 1 is the first row (ID=1) one is the last row (ID=4). The column headers DataX is what i need the series to be paired by.
Example:
ID Insured Uninsured Rejected
1 12 3 0
4 16 9 2
In the BarChart i need to see the number of insured or ID=1 and ID=2 next to each other, the number of Uninsured and rejected the same.
I feel like i have tried all ways possible but was not able to get anything besides a BarChart where all values of ID=1 where displayed and then all values for ID=2 where displayed next to each other.
Im sure this was a very confusing way to describe it, but i hope someone can understand what i am looking for.
NOTE: I tried to do this in Excel, and it worked within 2 minutes. I set the filter: Series on the 2 rows that i wanted, and set the Categories to the dataX Columns as described, and everything looked great. When i tried to translate this into SSRS i was able to do all the same things in the Series and Categories, but then i had to put in values and that screwed everything up.
PLEASE HELP!
I bet you need to add a grouping to your values by a spanning factor.