PIG: How to create percentage (%) based table? - apache-pig

I am trying to create a table that will show the number of occurrence in percentage. For example: I have a table, named as example that contains data as:
class, value
------ -------
1 , abc
1 , abc
1 , xyz
1 , abc
2 , xyz
2 , abc
Here, for the class value 1, 'abc' occurred 3 times and 'xyz' occurred only once out of total occurrence of 4 times. For class value 2, 'abc' and 'xyz' occurred once (out of total two times occurrence).
So, the output is:
class, %_of_abc, %_of_xyz
------ -------- --------
1 , 75 , 25
2 , 50 , 50
Any idea how to do it where both the column values are changing? I was thinking to do it using GROUP. But not sure if I group it by class value, how it could help me.

little bit complex, but here the solution
grunt> Dump A;
(1,abc)
(1,abc)
(1,xyz)
(1,abc)
(2,xyz)
(2,abc)
grunt> B = Group A by class;
grunt> C = foreach B generate group as class:int, COUNT(A) as cnt;
grunt> D = Group A by (class,value);
grunt> E = foreach D generate FLATTEN(group), COUNT(A) as tot_cnt;
grunt> F = foreach E generate $0 as class:int, $1 as value:chararray, tot_cnt;
grunt> G = JOIN F BY class,C BY class;
grunt> H = foreach G generate $0 as class,$1 as value,($2*100/$4) as perc;
grunt> Dump H;
(1,xyz,25)
(1,abc,75)
(2,xyz,50)
(2,abc,50)
I = grouy H by class;
J = FOREACH I generate group as class, FLATTEN(BagToTuple(H.perc));
Dump J;
(1,75,25)
(2,50,50)

Related

Finding a pair of row in SQL

I am very confused how to define the problem statement but Let's say below is table History i want to find those rows which have a pair.
Pair I will defined like column a and b will have same value and c should have False and d should be different for both row.
If I am using Java i would have set row 3, C column as true when i hit a pair or would have saved both row 1 and row 3 into different list. So that row 2 can be excluded. But i don't know how to do the same functionality in SQL.
Table - History
col a, b, c(Boolean ), d
1 bb F d
1 bb F d
1 bb F c
Query ? ----
Result - rows 1 and 3.
Assuming the table is called test:
SELECT
*
FROM
test
WHERE id IN (
SELECT
MIN(id)
FROM
test
WHERE
!c
AND a = b
AND d != a
GROUP BY a, d
)
We get the smallest id of every where matching your conditions. Furthermore we group the results by a, d which means we get only unique pairs of "a and d". Then we use this ids to select the rows we want.
Working example.
Update: without existing id
# add PK afterwards
ALTER TABLE test ADD COLUMN id INT PRIMARY KEY AUTO_INCREMENT FIRST;
Working example.
All the rows match the conditioin you specified. A "pair" happens when:
column a and b will have same value, and
c should have False, and
d should be different for both rows.
1 and 3 will match that as well as 2 and 3. Also, 3 and 1 will match as well as 3 and 2. There are four solutions.
You don't say which database, so I'll assume PostgreSQL. The query that can search using your criteria is:
select *
from t x
where exists (
select null from t y
where y.a = x.a
and y.b = x.b
and not y.c
and y.d <> x.d
);
Result:
a b c d
-- --- ------ -
1 bb false d
1 bb false d
1 bb false c
That is... the whole table.
See running example at DB Fiddle.

how to calculate at time series in pig

Lets if I write DUMP monthly, I get:
(Jan,2)
(Feb,102)
(Mar,250)
(Apr,450)
(May,590)
(Jun,790)
(Jul,1040)
(Aug,1260)
(Sep,1440)
(Oct,1770)
(Nov,2000)
(Dec,2500)
Checking schema:
DESCRIBE monthly;
Output:
monthly: {group: chararray,total_case: long}
I need to calculate increase rate for each month. So, for February, it will be:
(total_case in Feb - total_case in Jan) / total_case in Jan = (102 - 2) / 2 = 50
For March it will be: (250 - 102) / 102 = 1.45098039
So, if I put the records in monthlyIncrease, by writing DUMP monthlyIncrease, I will get:
(Jan,0)
(Feb,50)
(Mar,1.45098039)
........
........
(Dec, 0.25)
Is it possible in pig? I can't think of any way to do this.
Possible. Create a similar relation say b.Sort both relations by month. Rank both relations a,b. Join on a.rank = b.rank + 1 and then do the calculations.You will have to union the (Jan,0) record.
Assuming monthly is sorted by the group(month)
monthly = LOAD '/test.txt' USING PigStorage('\t') as (a1:chararray,a2:int);
a = rank monthly;
b = rank monthly;
c = join a by $0, b by ($0 + 1);
d = foreach c generate a::a1,(double)((a::a2 - b::a2)*1.0/(b::a2)*1.0);
e = limit monthly 1;
f = foreach e generate e.$0,0.0;
g = UNION d,f;
dump g;
Result

hive sql subset based on first value and unique group

I have the following table sorted in a specific manner (in HiveSQL):
ID Binary UnnecessaryVar
1 F a
1 F b
1 T c
1 F d
2 F e
2 T f
2 F g
I would like to select all rows FOR EACH ID before the first T in Binary variable, including the record where the variable is T. The result of the solution applied to the table above would be:
ID Binary UnnecessaryVar
1 F a
1 F b
1 T c
2 F e
2 T f
Thank you in advance
SQL tables represent unordered sets. There is not "ordering" without a column to specify it. If you have an order by clause, you can easily add such an ordering:
select . . . ,
row_number() over (order by <keys used in order by>) as seqnum
. . .
So let me assume you have such a column. Here is a pretty simple method:
select q.*
from (select q.*,
min(case when binary = 'T' then seqnum end) over
(partition by id) as seqnum_t
from <your query here> q
) q
where seqnum <= seqnum_t or seqnum_t is null;

sum over rows split between two columns

my data looks like this and I cant figure out how to obtain the column "want". I've tried various combinations of retain, lag and sum functions with no success unfortunately.
month quantity1 quantity2 want
1 a x x+sum(b to l)
2 b y sum(x to y)+sum(c to l)
3 c z sum(x to z)+sum(d to l)
4 d
5 e
6 f
7 g
8 h
9 i
10 j
11 k
12 l
Thank you for any help on this matter
It is convenient to sum quantity1 and then store value to macro variable. Use superfluous' data example:
proc sql;
select sum(qty1) into:sum_qty1 from temp;
quit;
data want;
set temp;
value1+qty1;
value2+qty2;
want=value2+&sum_qty1-value1;
if missing(qty2) then want=.;
drop value:;
run;
You may be able to do this in one step, but the following produces the desired result in two. The first step is to calculate the sum of the relevant quantity1 values, and the second is to add them to the sum of the relevant quantity2 values:
data temp;
input month qty1 qty2;
datalines;
1 1 100
2 1 100
3 1 100
4 1 .
5 1 .
6 1 .
7 1 .
8 1 .
9 1 .
10 1 .
11 1 .
12 1 .
;
run;
proc sql;
create table qty1_sums as select distinct
a.*, sum(b.qty1) as qty1_sums
from temp as a
left join temp as b
on a.month < b.month
group by a.month;
create table want as select distinct
a.*,
case when not missing(a.qty2) then sum(a.qty1_sums, sum(b.qty2)) end as want
from qty1_sums as a
left join temp as b
on a.month >= b.month
group by a.month;
quit;
Sounds like a 'rolling 12 months sum'. If so, much easier to do with a different data structure (not 2 variables, but 24 rows 1 variable); then you have all of the ETS tools, or a simple process in either SQL or SAS data step.
If you can't/won't restructure your data, then you can do this by loading the data into temporary arrays (or hash table but arrays are simpler for a novice). That gives you access to the whole thing right up front. Example:
data have;
do month = 1 to 12;
q_2014 = rand('Uniform')*10+500+month*5;
q_2015 = rand('Uniform')*10+550+month*5;
output;
end;
run;
data want;
array q2014[12] _temporary_; *temporary array to hold data;
array q2015[12] _temporary_;
if _n_=1 then do; *load data into temporary arrays;
do _n = 1 to n_data;
set have point=_n nobs=n_data;
q2014[_n] = q_2014;
q2015[_n] = q_2015;
end;
end;
set have;
do _i = 1 to _n_; *grab the this_year data;
q_rolling12 = sum(q_rolling12,q2015[_i]);
end;
do _j = _n_+1 to n_data;
q_rolling12 = sum(q_rolling12,q2014[_j]);*grab the last_year data;
end;
run;

Using a set in the place of a table (or another elegant solution)

I answered a question where I had to generate a temporary derived table on the fly (or use an actual table), see: https://stackoverflow.com/a/24890815/1688441 .
Instead of using the following derived table (using select and union):
(SELECT 21 AS id UNION SELECT 22) AS tmp
within:
SELECT GROUP_CONCAT(CASE WHEN COLUMN1 IS NULL THEN "NULL" ELSE COLUMN1 END)
FROM archive
RIGHT OUTER JOIN
(SELECT 21 AS id UNION SELECT 22) AS tmp ON tmp.id=archive.column2;
I would much prefer to be able to use something much more elegant such as:
([[21],[22]]) AS tmp
Is there any such notation within any of the SQL databases or any similar features? Is there an easy way to use a set in the place of a table in from (when I say set I mean a list of values in 1 dimension) as we use with IN.
So, using such a notation a temporary table with 1 int column, and 1 string column having 2 rows would have:
([[21,'text here'],[22,'text here2']]) AS tmp
SQL Server allows this syntax:
SELECT A, B, C,
CASE WHEN D < 21 THEN ' 0-20'
WHEN D < 51 THEN '21-50'
WHEN D < 101 THEN '51-100'
ELSE '>101' END AS E
,COUNT(*) as "Count"
FROM (
values ('CAR', 1,2,22)
,('CAR', 1,2,23)
,('BIKE',1,3,2)
)TABLE_X(A,B,C,D)
GROUP BY A, B, C,
CASE WHEN D < 21 THEN ' 0-20'
WHEN D < 51 THEN '21-50'
WHEN D < 101 THEN '51-100'
ELSE '>101' END
yielding this:
A B C E Count
---- ----------- ----------- ------ -----------
BIKE 1 3 0-20 1
CAR 1 2 21-50 2