how to calculate at time series in pig - apache-pig

Lets if I write DUMP monthly, I get:
(Jan,2)
(Feb,102)
(Mar,250)
(Apr,450)
(May,590)
(Jun,790)
(Jul,1040)
(Aug,1260)
(Sep,1440)
(Oct,1770)
(Nov,2000)
(Dec,2500)
Checking schema:
DESCRIBE monthly;
Output:
monthly: {group: chararray,total_case: long}
I need to calculate increase rate for each month. So, for February, it will be:
(total_case in Feb - total_case in Jan) / total_case in Jan = (102 - 2) / 2 = 50
For March it will be: (250 - 102) / 102 = 1.45098039
So, if I put the records in monthlyIncrease, by writing DUMP monthlyIncrease, I will get:
(Jan,0)
(Feb,50)
(Mar,1.45098039)
........
........
(Dec, 0.25)
Is it possible in pig? I can't think of any way to do this.

Possible. Create a similar relation say b.Sort both relations by month. Rank both relations a,b. Join on a.rank = b.rank + 1 and then do the calculations.You will have to union the (Jan,0) record.
Assuming monthly is sorted by the group(month)
monthly = LOAD '/test.txt' USING PigStorage('\t') as (a1:chararray,a2:int);
a = rank monthly;
b = rank monthly;
c = join a by $0, b by ($0 + 1);
d = foreach c generate a::a1,(double)((a::a2 - b::a2)*1.0/(b::a2)*1.0);
e = limit monthly 1;
f = foreach e generate e.$0,0.0;
g = UNION d,f;
dump g;
Result

Related

How can I replace the LAST() function in MS Access with proper ordering on a rather large table?

I have an MS Access database with the two tables, Asset and Transaction. The schema looks like this:
Table ASSET
Key Date1 AType FieldB FieldC ...
A 2023.01.01 T1
B 2022.01.01 T1
C 2023.01.01 T2
.
.
TABLE TRANSACTION
Date2 Key TType1 TType2 TType3 FieldOfInterest ...
2022.05.31 A 1 1 1 10
2022.08.31 A 1 1 1 40
2022.08.31 A 1 2 1 41
2022.09.31 A 1 1 1 30
2022.07.31 A 1 1 1 30
2022.06.31 A 1 1 1 20
2022.10.31 A 1 1 1 45
2022.12.31 A 2 1 1 50
2022.11.31 A 1 2 1 47
2022.05.23 B 2 1 1 30
2022.05.01 B 1 1 1 10
2022.05.12 B 1 2 1 20
.
.
.
The ASSET table has a PK (Key).
The TRANSACTION table has a composite key that is (Key, Date2, Type1, Type2, Type3).
Given the above tables let's see an example:
Input1 = 2022.04.01
Input2 = 2022.08.31
Desired result:
Key FieldOfInterest
A 41
because if the Transactions in scope was to be ordered by Date2, TType1, TType2, TType3 all ascending then the record having FieldOfInterest = 41 would be the last one.
Note that Asset B is not in scope due to Asset.Date1 < Input1, neither is Asset C because AType != T1. Ultimately I am curious about the SUM(FieldOfInterest) of all the last transactions belonging to an Asset that is in scope determined by the input variables.
The following query has so far provided the right results but after upgrading to a newer MS Access version, the LAST() operation is no longer reliably returning the row which is the latest addition to the Transaction table.
I have several input values but the most important ones are two dates, lets call them InputDate1 and
InputDate2.
This is how it worked so far:
SELECT Asset.AType, Last(FieldOfInterest) AS CurrentValue ,Asset.Key
FROM Transaction
INNER JOIN Asset ON Transaction.Key = Asset.Key
WHERE Transaction.Date2 <= InputDate2 And Asset.Date1 >= InputDate1
GROUP BY Asset.Key, Asset.AType
HAVING Asset.AType='T1'
It is known that the grouped records are not guaranteed to be in any order. Obviously it is a mistake to rely on the order of the records of the group by operation will always keep the original table order but lets just ignore this for now.
I have been struggling to come up with the right way to do the following:
join the Asset and Transaction tables on Asset.Key = Transaction.Key
filter by Asset.Date1 >= InputDate1 AND Transaction.Date2 <= InputDate2
then I need to select one record for all Transaction.Key where Date2 and TType1 and TType2 and TType3 has the highest value. (this represents the actual last record for given Key)
As far as I know there is no way to order records within a group by clause which is unfortunate.
I have tried Ranking, but the Transactions table is large (800k rows) and the performance was very slow, I need something faster than this. The following are an example of three saved queries that I wrote and chained together but the performance is very disappointing probably due to the ranking step.
-- Saved query step_1
SELECT Asset.*, Transaction.*
FROM Transaction
INNER JOIN Asset ON Transaction.Key = Asset.Key
WHERE Transaction.Date2 <= 44926
AND Asset.Date1 >= 44562
AND Asset.aType = 'T1'
-- Saved query step_2
SELECT tr.FieldOfInterest, (SELECT Count(*) FROM
(SELECT tr2.Transaction.Key, tr2.Date2, tr2.Transaction.tType1, tr2.tType2, tr2.tType3 FROM step_1 AS tr2) AS tr1
WHERE (tr1.Date2 > tr.Date2 OR
(tr1.Date2 = tr.Date2 AND tr1.tType1 > tr.Transaction.tType1) OR
(tr1.Date2 = tr.Date2 AND tr1.tType1 = tr.Transaction.tType1 AND tr1.tType2 > tr.tType2) OR
(tr1.Date2 = tr.Date2 AND tr1.tType1 = tr.Transaction.tType1 AND tr1.tType2 = tr.tType2 AND tr1.tType3 > tr.tType3))
AND tr1.Key = tr.Transaction.Key)+1 AS Rank
FROM step_1 AS tr
-- Saved query step_3
SELECT SUM(FieldOfInterest) FROM step_2
WHERE Rank = 1
I hope I am being clear enough so that I can get some useful recommendations. I've been stuck with this for weeks now and really don't know what to do about it. I am open for any suggestions.
Reading the following specification
then I need to select one record for all Transaction.Key where Date2 and TType1 and TType2 and TType3 has the highest value. (this represents the actual last record for given Key)
Consider a simple aggregation for step 2 to retrieve the max values then in step 3 join all fields to first query.
Step 1 (rewritten to avoid name collision and too many columns)
SELECT a.[Key] AS Asset_Key, a.Date1, a.AType,
t.[Key] AS Transaction_Key, t.Date2,
t.TType1, t.TType2, t.TType3, t.FieldOfInterest
FROM Transaction t
INNER JOIN Asset a ON a.[Key] = a.[Key]
WHERE t.Date2 <= 44926
AND a.Date1 >= 44562
AND a.AType = 'T1'
Step 2
SELECT Transaction_Key,
MAX(Date2) AS Max_Date2,
MAX(TType1) AS TType1,
MAX(TType2) AS TType2,
MAX(TType3) AS TType3
FROM step_1
GROUP Transaction_Key
Step 3
SELECT s1.*
FROM step_1 s1
INNER JOIN step_2 s2
ON s1.Transaction_Key = s2.Transaction_Key
AND s1.Date2 = s2.Max_Date2
AND s1.TType1 = s2.Max_TType1
AND s1.TType2 = s2.Max_TType2
AND s1.TType3 = s2.Max_TType3

Sql, trying to sum values up to a specific value

I have a problem where I have a large count of values on one side(a) and need to sum themup to a single value on the other(x) . There is no logical grouping to get to the total value(x)
On side (a) there are 10000+ items that need to be summed to a single value on the (z) side. Not all of the values on side (a) are needed to sum up to (z)
(a) (z)
123. 2
321. 19
234. 100
122
1
23
1
19
77
Expected output:
(a) 1, 1. = (z) 2
(a) 19. = (z) 19
(a) 23, 77. = (z) 100
Sum(a) to equal a value in (z)
My current code groups on date but now that will not work as I do not have a predefined date range.
Current code:
Select * From
(
Select sum(amount), date
From (a)
Group by date
) a
Inner join
(
Select amount,date
From (z)
) b on a.date = b.date
Where a.Amount - b.Amount = 0
This sounds like a self-join:
select z.z, a1.amount, a2.amount
from z z left join
(a a1 left join
a a2
on a1.amount < a2.amount
)
on z.amount = a1.amount + coalesce(a2.amount, 0);

sum over rows split between two columns

my data looks like this and I cant figure out how to obtain the column "want". I've tried various combinations of retain, lag and sum functions with no success unfortunately.
month quantity1 quantity2 want
1 a x x+sum(b to l)
2 b y sum(x to y)+sum(c to l)
3 c z sum(x to z)+sum(d to l)
4 d
5 e
6 f
7 g
8 h
9 i
10 j
11 k
12 l
Thank you for any help on this matter
It is convenient to sum quantity1 and then store value to macro variable. Use superfluous' data example:
proc sql;
select sum(qty1) into:sum_qty1 from temp;
quit;
data want;
set temp;
value1+qty1;
value2+qty2;
want=value2+&sum_qty1-value1;
if missing(qty2) then want=.;
drop value:;
run;
You may be able to do this in one step, but the following produces the desired result in two. The first step is to calculate the sum of the relevant quantity1 values, and the second is to add them to the sum of the relevant quantity2 values:
data temp;
input month qty1 qty2;
datalines;
1 1 100
2 1 100
3 1 100
4 1 .
5 1 .
6 1 .
7 1 .
8 1 .
9 1 .
10 1 .
11 1 .
12 1 .
;
run;
proc sql;
create table qty1_sums as select distinct
a.*, sum(b.qty1) as qty1_sums
from temp as a
left join temp as b
on a.month < b.month
group by a.month;
create table want as select distinct
a.*,
case when not missing(a.qty2) then sum(a.qty1_sums, sum(b.qty2)) end as want
from qty1_sums as a
left join temp as b
on a.month >= b.month
group by a.month;
quit;
Sounds like a 'rolling 12 months sum'. If so, much easier to do with a different data structure (not 2 variables, but 24 rows 1 variable); then you have all of the ETS tools, or a simple process in either SQL or SAS data step.
If you can't/won't restructure your data, then you can do this by loading the data into temporary arrays (or hash table but arrays are simpler for a novice). That gives you access to the whole thing right up front. Example:
data have;
do month = 1 to 12;
q_2014 = rand('Uniform')*10+500+month*5;
q_2015 = rand('Uniform')*10+550+month*5;
output;
end;
run;
data want;
array q2014[12] _temporary_; *temporary array to hold data;
array q2015[12] _temporary_;
if _n_=1 then do; *load data into temporary arrays;
do _n = 1 to n_data;
set have point=_n nobs=n_data;
q2014[_n] = q_2014;
q2015[_n] = q_2015;
end;
end;
set have;
do _i = 1 to _n_; *grab the this_year data;
q_rolling12 = sum(q_rolling12,q2015[_i]);
end;
do _j = _n_+1 to n_data;
q_rolling12 = sum(q_rolling12,q2014[_j]);*grab the last_year data;
end;
run;

PIG: How to create percentage (%) based table?

I am trying to create a table that will show the number of occurrence in percentage. For example: I have a table, named as example that contains data as:
class, value
------ -------
1 , abc
1 , abc
1 , xyz
1 , abc
2 , xyz
2 , abc
Here, for the class value 1, 'abc' occurred 3 times and 'xyz' occurred only once out of total occurrence of 4 times. For class value 2, 'abc' and 'xyz' occurred once (out of total two times occurrence).
So, the output is:
class, %_of_abc, %_of_xyz
------ -------- --------
1 , 75 , 25
2 , 50 , 50
Any idea how to do it where both the column values are changing? I was thinking to do it using GROUP. But not sure if I group it by class value, how it could help me.
little bit complex, but here the solution
grunt> Dump A;
(1,abc)
(1,abc)
(1,xyz)
(1,abc)
(2,xyz)
(2,abc)
grunt> B = Group A by class;
grunt> C = foreach B generate group as class:int, COUNT(A) as cnt;
grunt> D = Group A by (class,value);
grunt> E = foreach D generate FLATTEN(group), COUNT(A) as tot_cnt;
grunt> F = foreach E generate $0 as class:int, $1 as value:chararray, tot_cnt;
grunt> G = JOIN F BY class,C BY class;
grunt> H = foreach G generate $0 as class,$1 as value,($2*100/$4) as perc;
grunt> Dump H;
(1,xyz,25)
(1,abc,75)
(2,xyz,50)
(2,abc,50)
I = grouy H by class;
J = FOREACH I generate group as class, FLATTEN(BagToTuple(H.perc));
Dump J;
(1,75,25)
(2,50,50)

Aggregate adjacent only records with T-SQL

I have (simplified for the example) a table with the following data
Row Start Finish ID Amount
--- --------- ---------- -- ------
1 2008-10-01 2008-10-02 01 10
2 2008-10-02 2008-10-03 02 20
3 2008-10-03 2008-10-04 01 38
4 2008-10-04 2008-10-05 01 23
5 2008-10-05 2008-10-06 03 14
6 2008-10-06 2008-10-07 02 3
7 2008-10-07 2008-10-08 02 8
8 2008-10-08 2008-11-08 03 19
The dates represent a period in time, the ID is the state a system was in during that period and the amount is a value related to that state.
What I want to do is to aggregate the Amounts for adjacent rows with the same ID number, but keep the same overall sequence so that contiguous runs can be combined. Thus I want to end up with data like:
Row Start Finish ID Amount
--- --------- ---------- -- ------
1 2008-10-01 2008-10-02 01 10
2 2008-10-02 2008-10-03 02 20
3 2008-10-03 2008-10-05 01 61
4 2008-10-05 2008-10-06 03 14
5 2008-10-06 2008-10-08 02 11
6 2008-10-08 2008-11-08 03 19
I am after a T-SQL solution that can be put into a SP, however I can't see how to do that with simple queries. I suspect that it may require iteration of some sort but I don't want to go down that path.
The reason I want to do this aggregation is that the next step in the process is to do a SUM() and Count() grouped by the unique ID's that occur within the sequence, so that my final data will look something like:
ID Counts Total
-- ------ -----
01 2 71
02 2 31
03 2 33
However if I do a simple
SELECT COUNT(ID), SUM(Amount) FROM data GROUP BY ID
On the original table I get something like
ID Counts Total
-- ------ -----
01 3 71
02 3 31
03 2 33
Which is not what I want.
If you read the book "Developing Time-Oriented Database Applications in SQL" by R T Snodgrass (the pdf of which is available from his web site under publications), and get as far as Figure 6.25 on p165-166, you will find the non-trivial SQL which can be used in the current example to group the various rows with the same ID value and continuous time intervals.
The query development below is close to correct, but there is a problem spotted right at the end, that has its source in the first SELECT statement. I've not yet tracked down why the incorrect answer is being given. [If someone can test the SQL on their DBMS and tell me whether the first query works correctly there, it would be a great help!]
It looks something like:
-- Derived from Figure 6.25 from Snodgrass "Developing Time-Oriented
-- Database Applications in SQL"
CREATE TABLE Data
(
Start DATE,
Finish DATE,
ID CHAR(2),
Amount INT
);
INSERT INTO Data VALUES('2008-10-01', '2008-10-02', '01', 10);
INSERT INTO Data VALUES('2008-10-02', '2008-10-03', '02', 20);
INSERT INTO Data VALUES('2008-10-03', '2008-10-04', '01', 38);
INSERT INTO Data VALUES('2008-10-04', '2008-10-05', '01', 23);
INSERT INTO Data VALUES('2008-10-05', '2008-10-06', '03', 14);
INSERT INTO Data VALUES('2008-10-06', '2008-10-07', '02', 3);
INSERT INTO Data VALUES('2008-10-07', '2008-10-08', '02', 8);
INSERT INTO Data VALUES('2008-10-08', '2008-11-08', '03', 19);
SELECT DISTINCT F.ID, F.Start, L.Finish
FROM Data AS F, Data AS L
WHERE F.Start < L.Finish
AND F.ID = L.ID
-- There are no gaps between F.Finish and L.Start
AND NOT EXISTS (SELECT *
FROM Data AS M
WHERE M.ID = F.ID
AND F.Finish < M.Start
AND M.Start < L.Start
AND NOT EXISTS (SELECT *
FROM Data AS T1
WHERE T1.ID = F.ID
AND T1.Start < M.Start
AND M.Start <= T1.Finish))
-- Cannot be extended further
AND NOT EXISTS (SELECT *
FROM Data AS T2
WHERE T2.ID = F.ID
AND ((T2.Start < F.Start AND F.Start <= T2.Finish)
OR (T2.Start <= L.Finish AND L.Finish < T2.Finish)));
The output from that query is:
01 2008-10-01 2008-10-02
01 2008-10-03 2008-10-05
02 2008-10-02 2008-10-03
02 2008-10-06 2008-10-08
03 2008-10-05 2008-10-06
03 2008-10-05 2008-11-08
03 2008-10-08 2008-11-08
Edited: There's a problem with the penultimate row - it should not be there. And I'm not clear (yet) where it is coming from.
Now we need to treat that complex expression as a query expression in the FROM clause of another SELECT statement, which will sum the amount values for a given ID over the entries that overlap with the maximal ranges shown above.
SELECT M.ID, M.Start, M.Finish, SUM(D.Amount)
FROM Data AS D,
(SELECT DISTINCT F.ID, F.Start, L.Finish
FROM Data AS F, Data AS L
WHERE F.Start < L.Finish
AND F.ID = L.ID
-- There are no gaps between F.Finish and L.Start
AND NOT EXISTS (SELECT *
FROM Data AS M
WHERE M.ID = F.ID
AND F.Finish < M.Start
AND M.Start < L.Start
AND NOT EXISTS (SELECT *
FROM Data AS T1
WHERE T1.ID = F.ID
AND T1.Start < M.Start
AND M.Start <= T1.Finish))
-- Cannot be extended further
AND NOT EXISTS (SELECT *
FROM Data AS T2
WHERE T2.ID = F.ID
AND ((T2.Start < F.Start AND F.Start <= T2.Finish)
OR (T2.Start <= L.Finish AND L.Finish < T2.Finish)))) AS M
WHERE D.ID = M.ID
AND M.Start <= D.Start
AND M.Finish >= D.Finish
GROUP BY M.ID, M.Start, M.Finish
ORDER BY M.ID, M.Start;
This gives:
ID Start Finish Amount
01 2008-10-01 2008-10-02 10
01 2008-10-03 2008-10-05 61
02 2008-10-02 2008-10-03 20
02 2008-10-06 2008-10-08 11
03 2008-10-05 2008-10-06 14
03 2008-10-05 2008-11-08 33 -- Here be trouble!
03 2008-10-08 2008-11-08 19
Edited: This is almost the correct data set on which to do the COUNT and SUM aggregation requested by the original question, so the final answer is:
SELECT I.ID, COUNT(*) AS Number, SUM(I.Amount) AS Amount
FROM (SELECT M.ID, M.Start, M.Finish, SUM(D.Amount) AS Amount
FROM Data AS D,
(SELECT DISTINCT F.ID, F.Start, L.Finish
FROM Data AS F, Data AS L
WHERE F.Start < L.Finish
AND F.ID = L.ID
-- There are no gaps between F.Finish and L.Start
AND NOT EXISTS
(SELECT *
FROM Data AS M
WHERE M.ID = F.ID
AND F.Finish < M.Start
AND M.Start < L.Start
AND NOT EXISTS
(SELECT *
FROM Data AS T1
WHERE T1.ID = F.ID
AND T1.Start < M.Start
AND M.Start <= T1.Finish))
-- Cannot be extended further
AND NOT EXISTS
(SELECT *
FROM Data AS T2
WHERE T2.ID = F.ID
AND ((T2.Start < F.Start AND F.Start <= T2.Finish) OR
(T2.Start <= L.Finish AND L.Finish < T2.Finish)))
) AS M
WHERE D.ID = M.ID
AND M.Start <= D.Start
AND M.Finish >= D.Finish
GROUP BY M.ID, M.Start, M.Finish
) AS I
GROUP BY I.ID
ORDER BY I.ID;
id number amount
01 2 71
02 2 31
03 3 66
Review:
Oh! Drat...the entry for 3 has twice the 'amount' that it should have. Previous 'edited' parts indicate where things started to go wrong. It looks as though either the first query is subtly wrong (maybe it is intended for a different question), or the optimizer I'm working with is misbehaving. Nevertheless, there should be an answer closely related to this that will give the correct values.
For the record: tested on IBM Informix Dynamic Server 11.50 on Solaris 10. However, should work fine on any other moderately standard-conformant SQL DBMS.
Probably need to create a cursor and loop through the results, keeping track of which id you are working with and accumulating the data along the way. When the id changes you can insert the accumulated data into a temporary table and return the table at the end of the procedure (select all from it). A table-based function might be better as you can then just insert into the return table as you go along.
I suspect that it may require iteration of some sort but I don't want to go down that path.
I think that's the route you'll have to take, use a cursor to populate a table variable. If you have a large number of records you could use a permanent table to store the results then when you need to retrieve the data you could process only the new data.
I would add a bit field with a default of 0 to the source table to keep track of which records have been processed. Assuming no one is using select * on the table, adding a column with a default value won't affect the rest of your application.
Add a comment to this post if you want help coding the solution.
Well I decided to go down the iteration route using a mixture of joins and cursors. By JOINing the data table against itself I can create a link list of only those records that are consecutive.
INSERT INTO #CONSEC
SELECT a.ID, a.Start, b.Finish, b.Amount
FROM Data a JOIN Data b
ON (a.Finish = b.Start) AND (a.ID = b.ID)
Then I can unwind the list by iterating over it with a cursor, and doing updates back to the data table to adjust (And delete the now extraneous records from the Data table)
DECLARE CCursor CURSOR FOR
SELECT ID, Start, Finish, Amount FROM #CONSEC ORDER BY Start DESC
#Total = 0
OPEN CCursor
FETCH NEXT FROM CCursor INTO #ID, #START, #FINISH, #AMOUNT
WHILE #FETCH_STATUS = 0
BEGIN
#Total = #Total + #Amount
#Start_Last = #Start
#Finish_Last = #Finish
#ID_Last = #ID
DELETE FROM Data WHERE Start = #Finish
FETCH NEXT FROM CCursor INTO #ID, #START, #FINISH, #AMOUNT
IF (#ID_Last<> #ID) OR (#Finish<>#Start_Last)
BEGIN
UPDATE Data
SET Amount = Amount + #Total
WHERE Start = #Start_Last
#Total = 0
END
END
CLOSE CCursor
DEALLOCATE CCursor
This all works and has acceptable performance for typical data that I am using.
I did find one small issue with the above code. Originally I was updating the Data table on each loop through the cursor. But this didn't work. It seems that you can only do one update on a record, and that multiple updates (in order to keep adding data) revert back to the reading the original contents of the record.