Count duplicate records by using linq - sql

Count duplicate records by using linq
.................................................................................................................................................
Col1 col2
x a
x a
x b
x b
y c
y c
y d
y d
z e
z e
z f
now i want count like follows
x a 2
x b 2
y c 2
y d 2
in linq plese any one assist me

table
.GroupBy(x=>new{x.col1,x.clo2})
.Select(x=>new{ x.key.col1,x.key.col2,x.Count(z=>z.col1)

var Result =
from t in table
group t by new
{
t.col1,
t.col2,
} into gt
select new
{
col1 = gt.Key.col1,
col2 = gt.Key.col2,
count = gt.Count(),
};

Related

SQL: select only data from one column, for which one of the value from second column is x

I have a problem which I can't describe without explaining this on this example:
So there are 2 columns like:
X Y
A 2
A 1
A 3
B 3
C 2
A 1
D 2
B 1
B 3
C 1
A 1
D 3
D 1
and now I would like to select only that data from X, where one of the values from Y is 2.
So my output should look like:
X Y
A 2
A 1
A 3
C 2
A 1
D 2
C 1
A 1
D 3
D 1
because Y=2 for X=B doesn't exist in the main table.
My question is what is the query for this operation? I tried something with CASE WHEN but something didn't fix for me.
Try
SELECT X FROM Table WHERE X IN (SELECT X FROM Table WHERE Y=2)
OR Try
SELECT t1.X FROM Table t1
INNER JOIN Table t2 ON t1.X = t2.X
WHERE t2.Y = 2
Try a subquery:
SELECT X FROM table WHERE X IN (SELECT X FROM table WHERE Y = 2);

How to Group By on multiple column in a pig script

What should be pig equivalent script of the below SQL query:
SELECT fld1, fld2, fld3, SUM(fld4)
FROM Table1
GROUP BY fld1, fld2, fld3;
For Table1:
A B C 2 X Y Z
A B C 3 X Y Z
A B D 2 X Y Z
A C D 2 X Y Z
A C D 2 X Y Z
A C D 2 X Y Z
OUTPUT:
A B C 5
A B D 2
A C D 6
Ref : https://pig.apache.org/docs/r0.11.1/basic.html#GROUP, you can
find a multi-group example
For your usecase below code should be suffice
A = load 'input.csv' using PigStorage(',') AS (fld1:chararray,fld2:chararray,fld3:chararray,fld4:long,fld5:chararray,fld6:chararray,fld7:chararray);
B = FOREACH(GROUP A BY (fld1,fld2,fld3)) GENERATE FLATTEN(group) AS (fld1,fld2,fld3), SUM(A.fld4) AS fld4_aggr;
DUMP B;

Find column from multiple column

I have one question about database query. Please refer below table.
Table : 1
ID Country
1 x
2 y
3 z
4 k
Table : 2
eng fre fre1 fre2
x x
x1 k y t
x2 n z
Output Table
id country
1 x
2 x1
3 x2
4 x1
How to achieve this in Hive?
Thank you so much for help.
You can join three times but it may run slow:
select a.id, coalesce(b.eng, c.eng, d.eng) as Country
from table_1 a
left join table_2 b on a.country=b.fre
left join table_2 c on a.country=c.fre1
left join table_2 d on a.country=d.fre2
;

What is the difference between WITH Query and SELECT Query?

Data : I have written two queries one is WITH and Other is SELECT and then self joining the table below, but both queries return different results, why it happens ?
table name is test_cur
ID_SOURCE_CUR ID_TARGET_CUR
------------- --------------
A B
B C
C D
D E
A Z
G A
K A
Q A
J J
K K
K L
L K
B A
Z A
So why the two queries below return different results ?
SELECT *
FROM test_cur tu, test_cur fu
WHERE tu.id_target_cur = 'A'
AND fu.id_source_cur = 'A'
AND tu.id_source_cur <> fu.id_target_cur;
returns 8 rows.
ID_SOURCE_CUR ID_TARGET_CUR ID_SOURCE_CUR_1 ID_TARGET_CUR_1
-------------- -------------- -------------- --------------
G A A B
K A A B
Q A A B
Z A A B
G A A Z
K A A Z
Q A A Z
B A A Z
And -
WITH qry1 AS
(SELECT *
FROM test_cur)
SELECT *
FROM qry1 tu, qry1 fu
WHERE tu.id_target_cur = 'A'
AND fu.id_target_cur = 'A'
AND tu.id_source_cur <> fu.id_target_cur;
returns 25 rows.
ID_SOURCE_CUR ID_TARGET_CUR ID_SOURCE_CUR_1 ID_TARGET_CUR_1
-------------- -------------- -------------- --------------
G A G A
G A K A
G A Q A
G A B A
G A Z A
K A G A
K A K A
K A Q A
K A B A
K A Z A
Q A G A
Q A K A
Q A Q A
Q A B A
Q A Z A
B A G A
B A K A
B A Q A
B A B A
B A Z A
Z A G A
Z A K A
Z A Q A
Z A B A
Z A Z A
Why ?
Your second query is different, you have a different WHERE clause. The first WHERE is :
WHERE tu.id_target_cur = 'A'
AND fu.id_source_cur = 'A'
AND tu.id_source_cur <> fu.id_target_cur;
The second is:
WHERE tu.id_target_cur = 'A'
AND fu.id_target_cur = 'A' -- this line is different, it should be fu.id_source_cur = 'A'
AND tu.id_source_cur <> fu.id_target_cur;
Change those and the results are the exact same on both queries.
The where clauses are different fu.id_source_cur = 'A' vs. fu.id_target_cur = 'A'

Max value in a many-to-many relationship

I'm using SQL Server 2008 and I have 3 tables, x, y and z. y exists to create a many-to-many relationship between x and z.
x y z
-- -- --
id xid id
zid sort
All of the above fields are int.
I want to find the best-performing method (excluding denormalising) of finding the z with the highest sort for any x, and return all fields from all three tables.
Sample data:
x: id
--
1
2
y: xid zid
--- ---
1 1
1 2
1 3
2 2
z: id sort
-- ----
1 5
2 10
3 25
Result set should be
xid zid
--- ---
1 3
2 2
Note that if more than one z exists with the same highest sort value, then I still only want one row per x.
Note also that in my real-world situation, there are other fields in all three tables which I will need in my result set.
One method is with a sub query. This however is only good for getting the ID of Z. If you need more/all columns from both x and z tables then this is not the best solution.
SELECT
x.id,
(
SELECT TOP 1
z.zid
FROM
y
INNER JOIN
z
ON
z.id = y.zid
WHERE
y.xid = x.id
ORDER BY
z.sort DESC
)
FROM
x
This is how you can do it and return all the data from all the tables.
SELECT
*
FROM
x
INNER JOIN
y
ON
y.xid = x.id
AND
y.zid =
(
SELECT TOP 1
z2.zid
FROM
y y2
INNER JOIN
z z2
ON
z2.id = y2.zid
WHERE
y2.xid = x.id
ORDER BY
z2.sort DESC
)
INNER JOIN
z
ON
z.id = y.zid
select xid,max(zid) as zid from y
group by xid
select xid, zid /* columns from x; and columns from y or z taken from q */
from (select y.xid, y.zid, /* columns from y or z */
row_number() over(partition by y.xid order by z.sort desc) r
from y
join z on z.id = y.zid
) q
join x on x.id = q.xid
where r = 1