How to do 'Summarizing' in Pig Latin? - sum

Im trying to do a summarize operation with pig.
For example, I have a table called t3:
product price country
A 5 Italy
B 4 USA
C 12 France
A 5 Italy
B 7 Russia
I need to do a summarize operation, using 2 keys: product and country.
I do concatenate operation, using product and country
I have to calculate the price, summarizing the price values just where CONCAT result repeats
Where CONCAT result does not repeat, price remains the same as in t3 table.
The expected output could be:
CONCAT Price_1
AItaly 10
BUSA 4
CFrance 12
BRussia 7
In pig i write following script (the code is wrong, but just to show an idea):
t3 = LOAD '/home/Desktop/3_table/3_table.data' AS (product:chararray, price:int, country:chararray);
c1 = FOREACH t3 GENERATE CONCAT(product, country);
c2 = FOREACH t3 GENERATE *, c1;
product_1 = GROUP c2 BY c1;
price_1 = FOREACH product_1 GENERATE group, SUM(product_1.price);
STORE price_1 INTO 'summarise_by_2_ID' USING PigStorage('\t');
Maybe someone can explain how to reach the expected result?
Thanks a lot in advance!

If you want to calculate the sum per product and country you do not need to use the concat function. Just group by those two fields.
A = LOAD 's.txt' USING PigStorage('\t') AS (product:chararray, price:int, country:chararray);
B = GROUP A BY (product, country);
C = FOREACH B GENERATE CONCAT(group.product,group.country), SUM(A.price);
Actually, the concat is not necessary here, it is only to format the output as expected.
DUMP C
(AItaly,10)
(BUSA,4)
(BRussia,7)
(CFrance,12)

Related

Combine data in multiple rows in Oracle

I have Oracle 12c so please answer my question based on using Oracle syntax. I want to combine data in multiple rows into 1 row. Please see expected result for an example.I tried using PIVOT function but it did not work for me because I want to PIVOT Call_day from previous row to latest row and want to have list of columns as shown in "Expected result" below. Thank you for your help.
Data in the table:
Acct_num Call_day Call_code Start_day_To_Call
1 04/23/2018 AA 04/02/2018
1 04/24/2018 NULL 04/02/2018
1 04/25/2018 CC 04/02/2018
2 04/26/2018 ZZ 05/02/2018
2 04/27/2018 CC 05/02/2018
If multiple calls made within Start_day_To_Call date then I want last 2 latest call pivot data as shown below:
Expected result:
Acct_num Call_day1 Call_day2 Call_code1 Call_code2 Start_day_To_Call
1 04/24/2018 04/25/2018 NULL CC 04/02/2018
2 04/26/2018 04/27/2018 ZZ CC 05/02/2018
If you want only two days you can use this query:
first you get last call for each acct_num and then find previous call and then fill data according to them. You can use an id to touch performance if needed.
select p.acct_num,
p.prev_last_day,
(select z.call_code
from test_tbl z
where z.acct_num = p.acct_num
and z.call_day = p.prev_last_day) prev_call_code,
last_day,
(select z.call_code
from test_tbl z
where z.acct_num = p.acct_num
and z.call_day = p.last_day) last_call_code,
p.start_day_to_call
from (select x.acct_num,
max(x.call_day) last_day,
max((select max(y.call_day)
from test_tbl y
where y.acct_num = x.acct_num
and y.call_day < x.call_day)) prev_last_day,
min(x.start_day_to_call) start_day_to_call
from test_tbl x
group by x.acct_num) p
order by p.acct_num

Using Exclude/Intersect in the same Table

i have this table:
Line Group
1 A
2 A1
3 A2
4 A2
5 ALA
i wanna make a query that selects only the lines that contain a,a1,a2 but not the one named ALA
i tried a query like this:
select linea from table where group like'%A%'
Except
select linea from table where group ='ALA'
But didnt worked what can i do to get the data i need?
Ty in advance

How to covert tuple to string in pig?

I have data as
id company
1 (a,b)
2 (a,c)
3 (f,g,h)
company is tuple, I generate it from BagToTuple(sortedbag.company) AS company.
I would like to remove the formate of tuple, I would like the data is looked as following:
id company
1 a b
2 a c
3 f g h
I would like the company column has no brackets and separate by space. Thanks.
===================update
I have the data set as
id company
1 a
1 b
1 a
2 c
2 a
I wrote the code as following:
record = load....
grp = GROUP record BY id;
newdata = FOREACH grp GENERATE group AS id,
COUNT(record) AS counts,
BagToTuple(record.company) AS company;
The output is looks like:
id count company
1 3 (a,b,a)
2 2 (c,a)
But I would like company can be sorted and distinct, and no Brackets, and divide by space.
What I expect result is as following:
id count company
1 3 a b
2 2 a c
I think you can just replace BagToTuple with BagToString in the last step:
newdata2 = FOREACH grp
GENERATE group AS id, COUNT(record) as counts,
BagToString(record.company, ' ') as company:chararray;
STORE newdata2 into outdir using PigStorage('#');
After the script runs
$ cat outdir2/part-r-00000
1#3#a b a
2#2#a c
for general tuple to bag, if you don't want UDF, you can do BagToString(TOBAG( your tuple ))
You can use the in-built FLATTEN() operator. http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Flatten+Operator.

kdb: Add a column showing sum of rows in a table with dynamic headers while ignoring nulls

I have a table whose columns are dynamic, except one column:A. The table also has some null values (0n) in it. How do I add another column that shows total of each row and either ignores the column that has "0n" in that particular row or takes 0 in its place.
Here is my code, it fails on sum and also does not ignore nulls.
addTotalCol:{[]
table:flip`A`B`C`D!4 4#til 9;
colsToSum: string (cols table) except `A; / don't sum A
table: update Total: sum (colsToSum) from table; / type error here. Also check for nulls
:table;
}
I think it is better to use functional update in your case:
addTotalCol:{[]
table:flip`A`B`C`D!4 4#til 9;
colsToSum:cols[table] except `A; / don't sum A
table:![table;();0b;enlist[`Total]!enlist(sum;enlist,colsToSum)];
:table;
}
Reason why it is not working is because your fourth line is parsed as:
table: update Total: sum (enlist"B";enlist"C";enlist"D") from table;
Since sum only works with numbers, it returns 'type error since your inputs are string.
Another solution to use colsToSum as string input:
addTotalCol:{[]
table:flip`A`B`C`D!4 4#til 9;
colsToSum:string cols[table] except `A; / don't sum A
table:get"update Total:sum(",sv[";";colsToSum],") from table"
:table;
}
Basically this will build the query in string before it is executed in q.
Still, functional update is preferred though.
EDIT: Full answer to sum 0n:
addTotalCol:{[]
table:flip`A`B`C`D!4 4#0n,til 9;
colsToSum:cols[table] except `A; / don't sum A
table:![table;();0b;enlist[`Total]!enlist(sum;(^;0;enlist,colsToSum))];
:table;
}
I think there is a cleaner version here without a functional form.
q)//let us build a table where our first col is symbols and the rest are numerics,
/// we will exclude first from row sums
q)t:flip `c0`c1`c2`c3!(`a`b`c`d;1 2 3 0N;0n 4 5 6f;1 2 3 0Nh)
q)//columns for sum
q)sc:cols[t] except `c0
q)///now let us make sure we fill in each column with zero,
/// add across rows and append as a new column
q)show t1:t,'flip enlist[`sumRows]!enlist sum each flip 0^t sc
c0 c1 c2 c3 sumRows
-------------------
a 1 1 2
b 2 4 2 8
c 3 5 3 11
d 6 6
q)meta t1
c | t f a
-------| -----
c0 | s
c1 | i
c2 | f
c3 | h
sumRows| f

Join two file data in pig to the output in a desired format

File 1 has the data:
Name ID
-------
Mark 1
Gary 2
Robert 3
File 2 has the data:
ID result
----------
1 success
2 Fail
3 success
I loaded the data into two variables a & b now I want to join the data in based on ID for which result is success. I am able to join but I am getting the data in an improper format.
a = load '/file1' as (Name:chararray,ID:int);
b = load '/file2' as (ID:int,result:chararray);
c = join a by a2, b by b1;
When I dump c I am getting the output in the format of (name,id,id,result)... How I need join a & b such that I can get the output in the format of (name,id,result)
You can't. What you have to do is project the fields that you want to keep using a FOREACH. You can do something like this:
D = FOREACH C GENERATE a::Name as Name, a::ID as ID, b::result as result ;
You can filter b before joining.
a = load '/file1' as (Name:chararray,ID:int);
b = load '/file2' as (ID:int,result:chararray);
z = FILTER b BY b2 == 'success';
Then join a and z.
c = join a by a2, z by b1;
Later you need to do something as mentioned by #m2ert in previous answer.