How to Group By on multiple column in a pig script - apache-pig

What should be pig equivalent script of the below SQL query:
SELECT fld1, fld2, fld3, SUM(fld4)
FROM Table1
GROUP BY fld1, fld2, fld3;
For Table1:
A B C 2 X Y Z
A B C 3 X Y Z
A B D 2 X Y Z
A C D 2 X Y Z
A C D 2 X Y Z
A C D 2 X Y Z
OUTPUT:
A B C 5
A B D 2
A C D 6

Ref : https://pig.apache.org/docs/r0.11.1/basic.html#GROUP, you can
find a multi-group example
For your usecase below code should be suffice
A = load 'input.csv' using PigStorage(',') AS (fld1:chararray,fld2:chararray,fld3:chararray,fld4:long,fld5:chararray,fld6:chararray,fld7:chararray);
B = FOREACH(GROUP A BY (fld1,fld2,fld3)) GENERATE FLATTEN(group) AS (fld1,fld2,fld3), SUM(A.fld4) AS fld4_aggr;
DUMP B;

Related

HOW TO PRINT A TO Z ALPHABETS IN QUERY WITHOUT USING TABLE

HOW TO PRINT A TO Z ALPHABETS IN QUERY WITHOUT USING TABLE
For Oracle, here's one option:
SQL> select chr(level + 64) letter
2 from dual
3 connect by level <= ascii('Z') - ascii('A') + 1;
LETTER
----------
A
B
C
D
E
F
<snip>
X
Y
Z
26 rows selected.
SQL>
For Postgres:
select chr(code)
from generate_series(ascii('A'), ascii('Z')) as t(code)
order by code
select chr(generate_series(65,97));
output
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z

Count duplicate records by using linq

Count duplicate records by using linq
.................................................................................................................................................
Col1 col2
x a
x a
x b
x b
y c
y c
y d
y d
z e
z e
z f
now i want count like follows
x a 2
x b 2
y c 2
y d 2
in linq plese any one assist me
table
.GroupBy(x=>new{x.col1,x.clo2})
.Select(x=>new{ x.key.col1,x.key.col2,x.Count(z=>z.col1)
var Result =
from t in table
group t by new
{
t.col1,
t.col2,
} into gt
select new
{
col1 = gt.Key.col1,
col2 = gt.Key.col2,
count = gt.Count(),
};

SQL: select only data from one column, for which one of the value from second column is x

I have a problem which I can't describe without explaining this on this example:
So there are 2 columns like:
X Y
A 2
A 1
A 3
B 3
C 2
A 1
D 2
B 1
B 3
C 1
A 1
D 3
D 1
and now I would like to select only that data from X, where one of the values from Y is 2.
So my output should look like:
X Y
A 2
A 1
A 3
C 2
A 1
D 2
C 1
A 1
D 3
D 1
because Y=2 for X=B doesn't exist in the main table.
My question is what is the query for this operation? I tried something with CASE WHEN but something didn't fix for me.
Try
SELECT X FROM Table WHERE X IN (SELECT X FROM Table WHERE Y=2)
OR Try
SELECT t1.X FROM Table t1
INNER JOIN Table t2 ON t1.X = t2.X
WHERE t2.Y = 2
Try a subquery:
SELECT X FROM table WHERE X IN (SELECT X FROM table WHERE Y = 2);

SQL - get value from another table if column is null

I'm building matching rules for data reconciliation systems and need your advise on adjusting my sql for it as it currently doesn't return what I need.
There are 2 source tables:
Table X Table Y
--------------------- ----------------------
Exec_ID From To Exec_ID From To
1 A B 1 B C
2 A B 2 B C
3 A B 3 B C
4 A B
5 B C
Matching conditions are:
X.To = Y.From
X.Exec_ID = Y.Exec_ID
if there is A -> B and then B -> C, it should return A -> C in the end.
if there is only A -> B and no further B -> C, it should return A -> B.
So the output should be the following.
From To
---------
A C
A C
A C
A B
SQL I'm using is:
select X.From, Y.To
from x
left outer join y on
x.To = Y.From
and x.Exec_ID = y.Exec_ID
It returns the values like
A C
A C
A C
A Null
So the last record is incorrect as it should be A B. Please help to adjust.
Check for null?
select X.From, [To] = COALESCE(Y.To, X.To)
from x
left outer join y on
x.To = Y.From
and x.Exec_ID = y.Exec_ID

What is the difference between WITH Query and SELECT Query?

Data : I have written two queries one is WITH and Other is SELECT and then self joining the table below, but both queries return different results, why it happens ?
table name is test_cur
ID_SOURCE_CUR ID_TARGET_CUR
------------- --------------
A B
B C
C D
D E
A Z
G A
K A
Q A
J J
K K
K L
L K
B A
Z A
So why the two queries below return different results ?
SELECT *
FROM test_cur tu, test_cur fu
WHERE tu.id_target_cur = 'A'
AND fu.id_source_cur = 'A'
AND tu.id_source_cur <> fu.id_target_cur;
returns 8 rows.
ID_SOURCE_CUR ID_TARGET_CUR ID_SOURCE_CUR_1 ID_TARGET_CUR_1
-------------- -------------- -------------- --------------
G A A B
K A A B
Q A A B
Z A A B
G A A Z
K A A Z
Q A A Z
B A A Z
And -
WITH qry1 AS
(SELECT *
FROM test_cur)
SELECT *
FROM qry1 tu, qry1 fu
WHERE tu.id_target_cur = 'A'
AND fu.id_target_cur = 'A'
AND tu.id_source_cur <> fu.id_target_cur;
returns 25 rows.
ID_SOURCE_CUR ID_TARGET_CUR ID_SOURCE_CUR_1 ID_TARGET_CUR_1
-------------- -------------- -------------- --------------
G A G A
G A K A
G A Q A
G A B A
G A Z A
K A G A
K A K A
K A Q A
K A B A
K A Z A
Q A G A
Q A K A
Q A Q A
Q A B A
Q A Z A
B A G A
B A K A
B A Q A
B A B A
B A Z A
Z A G A
Z A K A
Z A Q A
Z A B A
Z A Z A
Why ?
Your second query is different, you have a different WHERE clause. The first WHERE is :
WHERE tu.id_target_cur = 'A'
AND fu.id_source_cur = 'A'
AND tu.id_source_cur <> fu.id_target_cur;
The second is:
WHERE tu.id_target_cur = 'A'
AND fu.id_target_cur = 'A' -- this line is different, it should be fu.id_source_cur = 'A'
AND tu.id_source_cur <> fu.id_target_cur;
Change those and the results are the exact same on both queries.
The where clauses are different fu.id_source_cur = 'A' vs. fu.id_target_cur = 'A'