SAS - Advanced querying - sql

I have one SQL table with the data in SAS. The first column is a datetime, and there is one row for each second. The set spans for about 20 minutes. The other columns contain integer values.
Here is what I need:
For example, Let's pick 50. How many times did the integer value go from below 50 to above 50 and stay above 50 for at least n seconds.
Is it possible to conduct such analysis with proc sql? If yes, how so, and if not, how else?
I am new to SAS, so any help is appreciated. Let me know if you need more info!
Thanks!

How many times did the integer value go from below/above 50
I think this could be solution to first part of the question. Resolution is maybe the best obtained by comparing current value with prior
data begin; /*Some test data...*/
input int_in_question;
datalines;
51
51
49
55
55
40
40
60
40
;
run;
data With_calc;
set begin;
if int_in_question < 50 and
lag(int_in_question)>=50
then Times_below_50+1;
run;

Related

SAS delete and group by

Simplified version of the dataset I have is:
DATA HAVE;
INPUT ID match1 $ match2 $ not_relevant;
DATALINES;
1 "ABC" "ABC" 4
1 "XYZ" "XYZ" 29
2 "QQQ" "AAA" 5
2 "ABC" "ABC" 9
3 "EFG" "EFG" 7
3 "DEF" "DEF" 12
3 "LMK" LMK" 16
3 "LMK" . 29
;RUN;
I am looking to compare match1 and match2, and if anywhere in the ID column match1 does not equal match2, I would like to remove all of the rows with that ID. So for this example dataset I want to remove all of ID 2 (rows 3 and 4) since row 3 does not have a match between match1 and match2. All I can figure out how to do so far is to delete the rows where they dont match, which isnt terribly helpful for this application. I assume it would be easier to make it a new data set with some wheres but I am unsure how to begin there. Any ideas / advice?
EDIT:
Apologies, I dumbed down my dataset too much and forgot about an important exception. Note in my new dataset (I only added one row to the end). I do NOT want to delete group 3, since match2 is blank. I only want to delete a group where match2 is not blank and match1 does not equal match2.
Thanks
There's a few ways to do this. One would be to just construct a dataset of IDs that have non-matching rows, then do a merge or a SQL join and remove anything that matched this list.
However, my preferred option (partly because of speed, but also it's more straightforward once you understand how it works) is the DoW loop.
data want;
id_nonmatch = 0;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if match1 ne match2 then id_nonmatch = 1; *set the flag to 1 if we find a nonmatch;
end;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if id_nonmatch = 0 then output;
end;
run;
There are two set statements on the data step, each of which runs through the same dataset separately. If it doesn't make sense, throw a put _all_; inside each of the do loops - that will show you what it's doing. The first loop goes over all of the rows for one ID, checks if any violate the constraint, and if none do, the flag variable (id_nonmatch) stays 0. If one does, it becomes a 1 (and stays that way). Then, when it hits an ID boundary, it stops pulling records from the first set statement, and goes onto the second - re-pulling those same rows. Now, it outputs only when the flag is a zero.
This is very efficient because of buffering - unless your id groups are very large, the data step may be able to use buffers to keep the same rows in memory and not have to reread them from disk. (This will depend on your disk and buffers - and seems to help much less on flash than on physical disks [since there is not the additional benefit of the disk head not having to move] - so your mileage may vary here.)
Just to show this difference, here is a log showing that there isn't much additional time needed for the second read - when the record is reasonably sized. This benefit is less when the record is very small - I imagine there is more overhead involved. Note that the second read adds only 1/7 of the time of the first read to the total processing time!
69 data have;
70 call streaminit(7);
71 length strvar $1000;
72 do id = 1 to 100000;
73 do iter = 1 to 50;
74 x = rand('Uniform');
75 output;
76 end;
77 end;
78 run;
NOTE: Variable strvar is uninitialized.
NOTE: The data set WORK.HAVE has 5000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 5.20 seconds
cpu time 5.20 seconds
79
80
81 data _null_;
82 do _n_ = 1 by 1 until (last.id);
83 set have;
84 by id;
85 end;
86 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.37 seconds
cpu time 2.37 seconds
87
88
89 data _null_;
90 do _n_ = 1 by 1 until (last.id);
91 set have;
92 by id;
93 end;
94 do _n_ = 1 by 1 until (last.id);
95 set have;
96 by id;
97 end;
98 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.74 seconds
cpu time 2.73 seconds
It is easy to do this with an SQL query with a GROUP BY and HAVING clause.
proc sql;
create table want as
select *
from have
group by id
having max( (match1 ne match2) and not missing(match2))
;
quit;
SAS evaluates boolean expressions as 1/0 for TRUE/FALSE so the MAX() of a series of TRUE/FALSE values will be TRUE if ANY of them are TRUE.

Creating a Nested/Loop Calculation in Vertica (?)

So maybe I'm just way over-thinking things, but is there any way to replicate a nested/loop calculation in Vertica with just SQL syntax.
Explanation -
In Column AP I have remaining values per month by an attribute key, in column CHANGE_1M I have an attribution value to apply.
The goal is for future values to calculate the preceding Row partition AP*CHANGE_1M, by the subsequent row partition CHANGE_1M to fill in the future AP values.
For reference I have 15,000 Keys Per Period and 60 Periods Per Year in the full-data set.
Sample Calculation
Period 5 =
(Period4_AP * Period5_CHANGE_1M)+Period4_AP
Period 6 =
(((Period4_AP * Period5_CHANGE_1M)+Period4_AP)*Period6_CHANGE_1M)
+
((Period4_AP * Period5_CHANGE_1M)+Period4_AP)
ect.
Sample Data on Top
Expected Results below
Vertica does not have (yet?) the RECURSIVE WITH clause, which you would need for the recursive calculation you seem to be needing here.
Only possible workaround would be tedious: write (or generate, using perl or Python, for example) as many nested queries as you need iterations.
I'll only want to detail this if you want to go down that path.
Long time no see - I should have returned to answer this question earlier.
I got so stuck on thinking of the programmatic way to solve this issue, I inherently forgot it is a math equation, and where you have math functions you have solutions.
Basically this question revolves around doing table multiplication.
The solution is to simply use LOG/LN functions to multiply and convert back using EXP.
Snippet of the simple solve.
Hope this helps other lost souls, don't forget your math background and spiral into a whirlpool of self-defeat.
EXP(SUM(LN(DEGREDATION)) OVER (ORDER BY PERIOD_NUMBER ASC ROWS UNBOUNDED PRECEDING)) AS DEGREDATION_RATE
** Controlled by what factors/attributes you need the data stratified by with a PARTITION
Basically instead of starting at the retention PX/P0, I back into with the degradation P1/P0 - P2/P1 ect.
PERIOD_NUMBER
DEGRADATION
DEGREDATION_RATE
DEGREDATION_RATE x 100000
0
100.00%
100.00%
100000.00
1
57.72%
57.72%
57715.18
2
60.71%
35.04%
35036.59
3
70.84%
24.82%
24820.66
4
76.59%
19.01%
19009.17
5
79.29%
15.07%
15071.79
6
83.27%
12.55%
12550.59
7
82.08%
10.30%
10301.94
8
86.49%
8.91%
8910.59
9
89.60%
7.98%
7984.24
10
86.03%
6.87%
6868.79
11
86.00%
5.91%
5907.16
12
90.52%
5.35%
5347.00
13
91.89%
4.91%
4913.46
14
89.86%
4.41%
4414.99
15
91.96%
4.06%
4060.22
16
89.36%
3.63%
3628.28
17
90.63%
3.29%
3288.13
18
92.45%
3.04%
3039.97
19
94.95%
2.89%
2886.43
20
92.31%
2.66%
2664.40
21
92.11%
2.45%
2454.05
22
93.94%
2.31%
2305.32
23
89.66%
2.07%
2066.84
24
94.12%
1.95%
1945.26
25
95.83%
1.86%
1864.21
26
92.31%
1.72%
1720.81
27
96.97%
1.67%
1668.66
28
90.32%
1.51%
1507.18
29
90.00%
1.36%
1356.46
30
94.44%
1.28%
1281.10
31
94.12%
1.21%
1205.74
32
100.00%
1.21%
1205.74
33
90.91%
1.10%
1096.13
34
90.00%
0.99%
986.52
35
94.44%
0.93%
931.71
36
100.00%
0.93%
931.71

SPSS Compute Variable

Below is some data:
Test Day1 Day2 Score
A 1 2 100
B 1 3 62
C 3 4 90
D 2 4 20
E 4 5 80
I am trying to take the values from column 'day' and 'day2' and use them to select the row number for the column score. For example for Test A I would like to find the sum of 100 and 62 because that is the values of the first and second rows of score. Test B I would like to find the sum of 100, 62 and 90.
Is their anyway to do this in the Compute Variable window? Found in the menu Transform-Compute Variable?
I tried the following:
Score(MEAN(VALUE(Day1), VALUE(DAY2)))
This is not the proper way to call the cell location of Score and I received an error.
Can anyone help?
Thank you!
You really have two different datasets here. One is a dataset of scores numbered 1 through 5.
The other is a dataset that includes indexes into the score dataset. So the steps would be something like this.
First take the scores dataset and transpose it so that it has one row and 5 columns (Data>Transpose)
Then match that dataset to each case in the main dataset (Data>Merge Files>Add Variables).
Next you have to resort to using syntax directly.
You would declare a vector for the scores (VECTOR)
Finally, you use COMPUTE to index into the scores.
For your real problem, I suppose that you might have batches of scores and maybe there are some gaps. The Restructure Data Wizard can help you generalize this - convert cases into variables, but let's not go there yet.
HTH,
Jon Peck

How to split a really long mysql result set into two lines?

Suppose you have a result that is 100 chars long but you only have a 50 char width. How do you split a MYSQL result into two rows of 50 chars each?
Could you clarify the question a bit? Are you looking to insert 100 chars of data into a 50 char column? Or do you have 100 chars in the database but only have space in your app to display 50 chars?
I have 100 chars in the database result set but I want the result set string to have a break after the 50th char and continue onto the next line.
Example
SELECT * FROM FOO
returns
1 2 3 4 5 6 7 8 9...50 51 52 53..98 99 100
but I want
1 2 3 4 5 6 7 8 9...50
51 52... 99 100
Is this possible?
SELECT substring(col, 1, 50) FROM foo
UNION ALL
SELECT substring(col, 51) FROM foo
Your'e asking a question about formatting data for viewing. SQL is a declarative data retrieval language, not a data pretty formatting language. You should solve this problem in your non-SQL code.
Formatting data in a SQL query is not a good idea, unless you have to write something that will run in a query analyzer. Your question isn't specific about whether or not that is the case.
Do you want to return the result set in PHP or MySQL? If the former, then it's easier.
Take the string, and take the first 100 characters, put in a line break, and then the rest of the string.
MySQL would work on the same principle, but you may have issues with line-break characters.

how to find Sum(field) in condition ie "select * from table where sum(field) < 150"

I have to retrieve only particular records whose sum value of size field is <=150.
I have table like below ...
userid size
1 70
2 100
3 50
4 25
5 120
6 90
The output should be ...
userid size
1 70
3 50
4 25
For example, if we add 70,50,25 we get 145 which is <=150.
How would I write a query to accomplish this?
Here's a query which will produce the above results:
SELECT * FROM `users` u
WHERE (select sum(size) from `users` where size <= u.size order by size) < 150
ORDER BY userid
However, the problem you describe of wanting the selection of users which would most closely fit into a given size, is a bin packing problem. This is an NP-Hard problem, and won't be easily solved with ANSI SQL. However, the above seems to return the right result, but in fact it simply starts with the smallest item, and continues to add items until the bin is full.
A general, more effective bin packing algorithm would is to start with the largest item and continue to add smaller ones as they fit. This algorithm would select users 5 and 4.
What you're looking for is a greedy algorithm. You can't really do this with one SQL statement.
It's similar to the subset sum problem. You are definitely going to be into exponential time ...
There are several ways to solve subset
sum in time exponential in N. The most
naïve algorithm would be to cycle
through all subsets of N numbers and,
for every one of them, check if the
subset sums to the right number. The
running time is of order O(2^N*N), since
there are 2N subsets and, to check
each subset, we need to sum at most N
elements.
Unless you can constrain the problem to smaller subsets.
According to your definition as it stands you could get any of these tables:
userid size userid size
1 70 2 100
userid size userid size
3 50 4 25
userid size userid size
5 120 6 90
userid size userid size
1 70 2 100
3 50 3 50
userid size userid size
1 70 2 100
4 25 4 25
userid size userid size
1 70 4 25
3 50 6 90
4 25
userid size userid size
4 25 3 50
5 120 6 90
SQL sucks at guessing. Do you mean to say you want the most users who's total size is under a certain limit? You'll need to create a temp table of all the combinations of users, then select the ones who's total size is less then the limit, then select the one with the most users, and possibly the lowest user ID or something. Either way, it won't be fast due to the first step.
But do you want to maximize the number of results or minimize or you simply don't care? first two cases is constraints optimization for which there should be solution using SQL, the latter (as mentioned above) requires greedy strategy.