sum over rows split between two columns - sum

my data looks like this and I cant figure out how to obtain the column "want". I've tried various combinations of retain, lag and sum functions with no success unfortunately.
month quantity1 quantity2 want
1 a x x+sum(b to l)
2 b y sum(x to y)+sum(c to l)
3 c z sum(x to z)+sum(d to l)
4 d
5 e
6 f
7 g
8 h
9 i
10 j
11 k
12 l
Thank you for any help on this matter

It is convenient to sum quantity1 and then store value to macro variable. Use superfluous' data example:
proc sql;
select sum(qty1) into:sum_qty1 from temp;
quit;
data want;
set temp;
value1+qty1;
value2+qty2;
want=value2+&sum_qty1-value1;
if missing(qty2) then want=.;
drop value:;
run;

You may be able to do this in one step, but the following produces the desired result in two. The first step is to calculate the sum of the relevant quantity1 values, and the second is to add them to the sum of the relevant quantity2 values:
data temp;
input month qty1 qty2;
datalines;
1 1 100
2 1 100
3 1 100
4 1 .
5 1 .
6 1 .
7 1 .
8 1 .
9 1 .
10 1 .
11 1 .
12 1 .
;
run;
proc sql;
create table qty1_sums as select distinct
a.*, sum(b.qty1) as qty1_sums
from temp as a
left join temp as b
on a.month < b.month
group by a.month;
create table want as select distinct
a.*,
case when not missing(a.qty2) then sum(a.qty1_sums, sum(b.qty2)) end as want
from qty1_sums as a
left join temp as b
on a.month >= b.month
group by a.month;
quit;

Sounds like a 'rolling 12 months sum'. If so, much easier to do with a different data structure (not 2 variables, but 24 rows 1 variable); then you have all of the ETS tools, or a simple process in either SQL or SAS data step.
If you can't/won't restructure your data, then you can do this by loading the data into temporary arrays (or hash table but arrays are simpler for a novice). That gives you access to the whole thing right up front. Example:
data have;
do month = 1 to 12;
q_2014 = rand('Uniform')*10+500+month*5;
q_2015 = rand('Uniform')*10+550+month*5;
output;
end;
run;
data want;
array q2014[12] _temporary_; *temporary array to hold data;
array q2015[12] _temporary_;
if _n_=1 then do; *load data into temporary arrays;
do _n = 1 to n_data;
set have point=_n nobs=n_data;
q2014[_n] = q_2014;
q2015[_n] = q_2015;
end;
end;
set have;
do _i = 1 to _n_; *grab the this_year data;
q_rolling12 = sum(q_rolling12,q2015[_i]);
end;
do _j = _n_+1 to n_data;
q_rolling12 = sum(q_rolling12,q2014[_j]);*grab the last_year data;
end;
run;

Related

How to add a sum column with respect to a group by

I have a table1 :
ZP age Sexe Count
A 40 0 5
A 40 1 3
C 55 1 2
And i want to add a column wich sum the count column by grouping the first two variables :
ZP age Sexe Count Sum
A 40 0 5 8
A 40 1 3 8
C 55 1 2 2
this is what i do :
CREATE TABLE table2 AS SELECT zp, age, SUM(count) FROM table1 GROUP BY zp, age
then :
CREATE TABLE table3 AS SELECT * FROM table1 NATURAL JOIN table2
But i have a feeling this is a sloppy way to do it. Do you know any better ways ? For example with no intermediates tables.
edit : i am using SQL through a proc sql in SAS
I'm not quite sure if there is a method for a single select statement but below will work without multiple create table statements:
data have;
length ZP $3 age 3 Sexe $3 Count 3;
input ZP $ age Sexe $ Count;
datalines;
A 40 0 5
A 40 1 3
C 55 1 2
;
run;
proc sql noprint;
create table WANT as
select a.*, b.SUM
from
(select * from HAVE) a,
(select ZP,sum(COUNT) as SUM from HAVE group by ZP) b
where a.ZP = b.ZP;
quit;
PROC SQL does not support enhanced SQL features like PARTITION.
But it looks like you want to include summarized data and detail rows at the same time? If that is the question then PROC SQL will do that for you automatically. If you include in your list of variables to select variables that are neither group by variables or summary statistics then SAS will automatically add in the needed re-joining of the summary statistics to the detail rows to produce the table you want.
proc sql;
SELECT zp, age, sexe, count, SUM(count)
FROM table1
group by zp, age
;
quit;
You can use SUM as follows with standard SQL:2003 syntax (I don't know if SAS accepts it):
SELECT zp, age, sexe, count, SUM(count) OVER (PARTITION BY zp, age)
FROM table1;
data have;
input ZP $ age Sexe Count;
datalines;
A 40 0 5
A 40 1 3
C 55 1 2
;
run;
proc sql;
create table want as select
*, sum(count) as sum
from have
group by zp, age;
quit;

sas/sql logic needed

I have a data with SSN and Open date and have to calculate if a customer has opened 2 or more accounts within 120 days based on the open_date field. I know to use INTCK/INTNX functions but it requires 2 date fields, not sure how to apply the same logic on a single field for same customer.Please suggest.
SSN account Open_date
xyz 000123 12/01/2015
xyz 112344 11/22/2015
xyz 893944 04/05/2016
abc 992343 01/10/2016
abc 999999 03/05/2016
123 111123 07/16/2015
123 445324 10/12/2015
You can use exists or join:
proc sql;
select distinct SSN
from t
where exists (select 1
from t t2
where t2.SSN = t.SSN and
t2.open_date between t.open_date and t.open_date + 120
);
I'd do it using JOIN :
proc sql;
create table want as
select *
from have
where SSN in
(select a.SSN
from have a
inner join have b
on a.SSN=b.SSN
where intck('day', a.Open_date, b.Open_Date)+1 < 120)
;
quit;
Just a slightly different solution here - use the dif function which calculates the number of days between accounts being open.
proc sort data=have;
by ssn open_date;
run;
data want;
set have;
by ssn;
days_between_open = dif(open_date);
if first.ssn then days_between_open = .;
*if 0 < days_between_open < 120 then output;
run;
Then you can filter the table above as required. I've left it commented out at this point because you haven't specified how you want your output table.

Using a set in the place of a table (or another elegant solution)

I answered a question where I had to generate a temporary derived table on the fly (or use an actual table), see: https://stackoverflow.com/a/24890815/1688441 .
Instead of using the following derived table (using select and union):
(SELECT 21 AS id UNION SELECT 22) AS tmp
within:
SELECT GROUP_CONCAT(CASE WHEN COLUMN1 IS NULL THEN "NULL" ELSE COLUMN1 END)
FROM archive
RIGHT OUTER JOIN
(SELECT 21 AS id UNION SELECT 22) AS tmp ON tmp.id=archive.column2;
I would much prefer to be able to use something much more elegant such as:
([[21],[22]]) AS tmp
Is there any such notation within any of the SQL databases or any similar features? Is there an easy way to use a set in the place of a table in from (when I say set I mean a list of values in 1 dimension) as we use with IN.
So, using such a notation a temporary table with 1 int column, and 1 string column having 2 rows would have:
([[21,'text here'],[22,'text here2']]) AS tmp
SQL Server allows this syntax:
SELECT A, B, C,
CASE WHEN D < 21 THEN ' 0-20'
WHEN D < 51 THEN '21-50'
WHEN D < 101 THEN '51-100'
ELSE '>101' END AS E
,COUNT(*) as "Count"
FROM (
values ('CAR', 1,2,22)
,('CAR', 1,2,23)
,('BIKE',1,3,2)
)TABLE_X(A,B,C,D)
GROUP BY A, B, C,
CASE WHEN D < 21 THEN ' 0-20'
WHEN D < 51 THEN '21-50'
WHEN D < 101 THEN '51-100'
ELSE '>101' END
yielding this:
A B C E Count
---- ----------- ----------- ------ -----------
BIKE 1 3 0-20 1
CAR 1 2 21-50 2

Join overlapping date ranges

I need to join table A and table B to create table C.
Table A and Table B store status flags for the IDs. The status flags (A_Flag and B_Flag) can change from time to time, so one ID can contain multiple rows, which represents the history of the ID's statuses. The flags for a particular ID can change independently of each other, which can result in one row in Table A belonging to multiple rows in Table B, and vice versa.
The resulting table (Table C) needs to be a list of unique date ranges covering every date within the IDs life (01/01/2008-18/08/2008), and A_Flag and B_Flag values for each date range.
The actual tables contain hundreds of IDs with each ID having a varying numbers of rows per table.
I have access to SQL and SAS tools to achieve the end result.
Source - Table A
ID Start End A_Flag
1 01/01/2008 23/03/2008 1
1 23/03/2008 15/06/2008 0
1 15/06/2008 18/08/2008 1
Source - Table B
ID Start End B_Flag
1 19/01/2008 17/02/2008 1
1 17/02/2008 15/06/2008 0
1 15/06/2008 18/08/2008 1
Result - Table C
ID Start End A_Flag B_Flag
1 01/01/2008 19/01/2008 1 0
1 19/01/2008 17/02/2008 1 1
1 17/02/2008 23/03/2008 1 0
1 23/03/2008 15/06/2008 0 0
1 15/06/2008 18/08/2008 1 1
I'm going to solve this in SQL, assuming that you have a function called lag (SQL Server 2012, Oracle, Postgres, DB2). You can get the same effect with a correlated subquery.
The idea is to get all the different time periods. Then join back to the original tables to get the flags.
I am having trouble uploading the code, but can get most of it. However, it starts with start ends, which you create by doing a union (not union all) of the four dates in one column: select a.start as thedate. This is then union'ed with a.end, b.start, and b.end.
with driver as (
select thedate as start, lag(thedate) over (order by thedate) as end
from startends
) 
select startdate, enddate, a.flag, b.flag
from driver left outer join
a
on a.start >= driver.start and a.end <= driver.end left outer join
b
on b.start >= driver.start and b.end <= driver.end
The problem you posed can be solved in one SQL statement without nonstandard extensions.
The most important thing to recognize is that the dates in the begin-end pairs each represent a potential starting or ending point of a time span during which the flag pair will be true. It actually doesn't matter that one date is a "begin" and another and "end"; any date is a time delimiter that does both: it ends a prior period and begins another. Construct a set of minimal time intervals, and join them to the tables to find the flags that obtained during each interval.
I added your example (and a solution) to my Canonical SQL page. See there for a detailed discussion. In fairness to SO, here's the query itself
with D (ID, bound) as (
select ID
, case T when 's' then StartDate else EndDate end as bound
from (
select ID, StartDate, EndDate from so.A
UNION
select ID, StartDate, EndDate from so.B
) as U
cross join (select 's' as T union select 'e') as T
)
select P.*, a.Flag as A_Flag, b.Flag as B_Flag
from (
select s.ID, s.bound as StartDate, min(e.bound) as EndDate
from D as s join D as e
on s.ID = e.ID
and s.bound < e.bound
group by s.ID, s.bound
) as P
left join so.A as a
on P.ID = a.ID
and a.StartDate <= P.StartDate and P.EndDate <= a.EndDate
left join so.B as b
on P.ID = b.ID
and b.StartDate <= P.StartDate and P.EndDate <= b.EndDate
order by P.ID, P.StartDate, P.EndDate
One possible SAS solution to this is to perform a partial join, and then create the necessary additional rows in the data step. This should work assuming tableA has all possible records; if that's not the case (if tableB can start before tableA), some additional logic may be needed to consider that possibility (if first.id and start gt b_start). There may also be additional logic needed for issues not present in the example data - I don't have a lot of time this morning and didn't debug this for anything beyond the example data cases, but the concept should be evident.
data tableA;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End A_Flag;
datalines;
1 01/01/2008 23/03/2008 1
1 23/03/2008 15/06/2008 0
1 15/06/2008 18/08/2008 1
;;;;
run;
data tableB;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End B_Flag;
datalines;
1 19/01/2008 17/02/2008 1
1 17/02/2008 15/06/2008 0
1 15/06/2008 18/08/2008 1
;;;;
run;
proc sql;
create table c_temp as
select * from tableA A
left join (select id, start as b_start, end as b_end, b_flag from tableB) B
on A.Id = B.id
where (A.start le B.b_start and A.end gt B.b_start) or (A.start lt B.b_end and A.end ge B.b_end)
order by A.ID, A.start, B.b_start;
quit;
data tableC;
set c_temp;
by id start;
retain b_flag_ret;
format start_fin end_fin DATE9.;
if first.id then b_flag_ret=0;
do until (start=end);
if (start lt b_start) and first.start then do;
start_fin=start;
end_fin=b_start;
a_flag_fin=a_flag;
b_flag_fin=b_flag_ret;
output;
start=b_start;
end;
else do; *start=b_start;
start_fin=ifn(start ge b_start, start, b_start);
end_fin = ifn(b_end le end, b_end, end);
a_flag_fin=a_flag;
b_flag_fin=b_flag;
output;
start=end; *leave the loop as there will be a later row that matches;
end;
end;
run;
This type of sequential processing with shifts and offsets is one of the situations where the SAS DATA step shines. Not that this answer is simple, but it is simpler than using SQL, which can be done, but isn't designed with this sequential processing in mind.
Furthermore, solutions based on DATA step tend to be very efficient. This one runs in time O(n log n) in theory, but closer to O(n) in practice, and in constant space.
The first two DATA steps are just loading data, slightly modified from Joe's answer, to have multiple IDs (otherwise the syntax is MUCH easier) and to add some corner cases, i.e., an ID for which it is impossible to determine initial state.
data tableA;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End A_Flag;
datalines;
1 01/01/2008 23/03/2008 1
2 23/03/2008 15/06/2008 0
2 15/06/2008 18/08/2008 1
;;;;
run;
data tableB;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End B_Flag;
datalines;
1 19/01/2008 17/02/2008 1
2 17/02/2008 15/06/2008 0
4 15/06/2008 18/08/2008 1
;;;;
run;
The next data step finds the first modification for each id and flag and sets the initial value to the opposite of what it found.
/* Get initial state by inverting first change */
data firstA;
set tableA;
by id;
if first.id;
A_Flag = ~A_Flag;
run;
data firstB;
set tableB;
by id;
if first.id;
B_Flag = ~B_Flag;
run;
data first;
merge firstA firstB;
by id;
run;
The next data step merges the artificial "first" table with the other two, retaining the last state known and discarding the artificial initial row.
data tableAB (drop=lastA lastB);
set first tableA tableB;
by id start;
retain lastA lastB lastStart;
if A_flag = . and ~first.id then A_flag = lastA;
else lastA = A_flag;
if B_flag = . and ~first.id then B_flag = lastB;
else lastB = B_flag;
if ~first.id; /* drop artificial first row per id */
run;
The steps above do almost everything.
The only bug is that the end dates will be wrong, because they are copied from the original row.
To fix that, copy the next start to each row's end, unless it is a final row.
The easiest way is to sort each id by reverse start, look back one record, then sort ascending again at the end.
/* sort descending to ... */
proc sort data=tableAB;
by id descending start;
run;
/* ... copy next start to this row's "end" field if not final */
data tableAB(drop=nextStart);
set tableAB;
by id descending start;
nextStart=lag(start);
if ~first.id then end=nextStart;
run;
proc sort data=tableAB;
by id start;
run;

MySQL - How to simplify this query?

i have a query which i want to simplify:
select
sequence,
1 added
from scoredtable
where score_timestamp=1292239056000
and sequence
not in (select sequence from scoredtable where score_timestamp=1292238452000)
union
select
sequence,
0 added
from scoredtable
where score_timestamp=1292238452000
and sequence
not in (select sequence from scoredtable where score_timestamp=1292239056000);
Any ideas? basically i want to extract from the same table all the sequences that are different betweent two timestamp values. With a colum "added" which represents if a row is new or if a row has been deleted.
Source table:
score_timestamp sequence
1292239056000 0
1292239056000 1
1292239056000 2
1292238452000 1
1292238452000 2
1292238452000 3
Example between (1292239056000, 1292238452000)
Query result (2 rows):
sequence added
3 1
0 0
Example between (1292238452000, 1292239056000)
Query result (2 rows):
sequence added
0 1
3 0
Example between (1292239056000, 1292239056000)
Query result (0 rows):
sequence added
This query gets all sequences that appear only once within both timestamps, and checks if it occurs for the first or for the second timestamp.
SELECT
sequence,
CASE WHEN MIN(score_timestamp) = 1292239056000 THEN 0 ELSE 1 END AS added
FROM scoredtable
WHERE score_timestamp IN ( 1292239056000, 1292238452000 )
AND ( 1292239056000 <> 1292238452000 ) -- No rows, when timestamp is the same
GROUP BY sequence
HAVING COUNT(*) = 1
It returns your desired result:
sequence added
3 1
0 0
Given two timestamps
SET #ts1 := 1292239056000
SET #ts2 := 1292238452000
you can get your additions and deletes with:
SELECT s1.sequence AS sequence, 0 as added
FROM scoredtable s1 LEFT JOIN
scoredtable s2 ON
s2.score_timestamp = #ts2 AND
s1.sequence = s2.sequence
WHERE
s1.score_timestamp = #ts1 AND
s2.score_timestampe IS NULL
UNION ALL
SELECT s2.sequence, 1
FROM scoredtable s1 RIGHT JOIN
scoredtable s2 ON s1.score_timestamp = #ts1 AND
s1.sequence = s2.sequence
WHERE
s2.score_timestamp = #ts2 AND
s1.score_timestampe IS NULL
depending on the number of rows and the statistics the above query might perform better then group by and having count(*) = 1 version (i think that will always need full table scan, while the above union should be able to do 2 x anti-join which might fare better)
If you have substantial data set, do let us know which is faster (test with SQL_NO_CACHE for comparable results)