Joining two datasets in spark sql (java)

Joining two datasets in spark sql (java) - apache-spark-sql

I have 2 Datasets with 4 columns each.
My dataSets:
'left':
a | b | c | d
1 | 2 | 3 | 4
'right':
a | b | c | e
1 | 2 | 3 | 5
What i would like to achieve:
a | b | c | d | e
1 | 2 | 3 | 4 | 5
My code:
left.join(right,left.col(a).equalTo(right.col(a))
.and(left.col(b).equalTo(right.col(b)))
.and(left.col(c).equalTo(right.col(c)))
)),"left");
I would like to add column 'e' from right to left, but what i get in return is:
a | b | c | d | a | b | c | e |
I get only a concatenation of the headers.
I thought from a logical point of view my query is correct, i am apparently failing at asking Spark how to perform the join.
Any tip?

you can do the join with the following. You just need to specify the list of keys when they corresspond
val df1 = Seq(
(1, 2, 3, 4)
).toDF("a", "b", "c", "d")
val df2 = Seq(
(1, 2, 3, 4)
).toDF("a", "b", "c", "e")
val df3 = df1.join(df2, Seq("a", "b", "c"), "left")

Related

How to get the frequency of each element in array from all rows in table in PostgreSql

I am having a table of 2 columns and N rows like below,
email | phone_num_list
----------------------
'a' | {"1", "2", "3"}
'a' | {"1", "4"}
'c' | {"5", "1", "6"}
'd' | {"3", "7", "1"}
where phone_num_list is of type array.
My requirement is to get the most used phone numbers and their scores, where score = number of distinct email associated with the phone_num * total frequency of phone_num
Hence, for the above example, distinct emails associated with phone_num 1 are ["a", "c", "d"].
Hence, the score of phone_num 1 is 3(i.e total distinct emails) * 4(i.e total frequency of 1)
Calculation of score for all the phone_num is written below -
phone_num | distinct emails associated | freq of phone_num | final_score
-------------------------------------------------------------------------
1 | ["a", "c", "d"] | 4 | 4*3 = 12
2 | ["a"] | 1 | 1*1 = 1
3 | ["a", "d"] | 2 | 2*2 = 4
4 | ["a"] | 1 | 1*1 = 1
5 | ["c"] | 1 | 1*1 = 1
6 | ["c"] | 1 | 1*1 = 1
7 | ["d"] | 1 | 1*1 = 1
My desired output is ->
phone | score
-------------
1 | 12
3 | 4
2 | 1
4 | 1
5 | 1
6 | 1
7 | 1
Please help me with the query in PostgreSql.
Thanks

Preparing the test case:
create temporary table t (email text, phone_num_list text[]);
insert into t(email, phone_num_list) values
('a', '{"1", "2", "3"}'),
('a', '{"1", "4"}'),
('c', '{"5", "1", "6"}'),
('d', '{"3", "7", "1"}');
'Normalize' the table into nt CTE and then calculate the frequency times the number of distinct emails per phone number.
with nt as (select email, unnest(phone_num_list) as phone from t)
select phone, count(*) * count(distinct email) as score
from nt group by phone order by score desc;
phone|score|
-----+-----+
1 | 12|
3 | 4|
5 | 1|
6 | 1|
7 | 1|
4 | 1|
2 | 1|

You can use "unnest" function.
select phone_number,
count(phone_number) * count(distinct email) as score
from
(
select email, unnest(phone_num_list) as phone_number
from t1
)z
group by 1
order by 2 desc
db-fiddle here:

How to use SQL to flatten a table with arrays in multiple columns?

How do I transform this table with arrays in num and letter columns:
id | num | letter
-----+-----------+-----------
111 | [1, 2] | [a, b]
111 | [3, 4] | [c, d]
222 | [5, 6, 7] | [e, f, g]
into this table
id | num | letter
-----+-----+--------
111 | 1 | a
111 | 2 | b
111 | 3 | c
111 | 4 | d
222 | 5 | e
222 | 6 | f
222 | 7 | g
Appendix:
here is some sql to play around with to try to perform the transformation
with test as(
select * from (
values
(111, array[1,2], array['a','b']),
(111, array[3,4], array['c','d']),
(222, array[5,6,7], array['e','f', 'g'])
) as t (id, num, letter)
)
select
*
from test

PrestoDB seems to support unnest() with multiple arguments:
select t.id, u.n, u.l
from test cross join
unnest(num, letter) as u(n, l)

Get max for each row between two dates?

learning sql here and I ran into a challenge.
I have the following table:
tbl <- data.frame(
id_name = c("a", "a", "b", "c", "d", "f", "b", "c", "d", "f"),
value = c(1, -1, 1, 1, 1, 1, -1, -1, -1, -1),
score = c(1, 0, 1, 2, 3, 4, 3, 2, 1, 0),
date = as.Date(c("2001-1-1", "2002-1-1", "2003-1-1", "2005-1-1",
"2005-1-1", "2007-1-1", "2008-1-1", "2010-1-1",
"2011-1-1", "2012-1-1"), "%Y-%m-%d")
)
+---------+-------+-------+-----------+
| id_name | value | score | date |
+---------+-------+-------+-----------+
| a | 1 | 1 | 2001-1-1 |
| a | -1 | 0 | 2002-1-1 |
| b | 1 | 1 | 2003-1-1 |
| c | 1 | 2 | 2005-1-1 |
| d | 1 | 3 | 2005-1-1 |
| f | 1 | 4 | 2007-1-1 |
| b | -1 | 3 | 2008-1-1 |
| c | -1 | 2 | 2010-1-1 |
| d | -1 | 1 | 2011-1-1 |
| f | -1 | 0 | 2012-1-1 |
+---------+-------+-------+-----------+
My goal is this:
For each id_name, I'd like to get the first date (in case of tie breakers) of maximum score from the tbl between the dates where the current row = id_name (inclusive)
For example, id_name 'a' should return '2001-1-1' since its score is 1
id_name 'b' should return '2007-1-1' since its score is 4:
+---------+----------+
| id_name | date |
+---------+----------+
| a | 2001-1-1 |
| b | 2007-1-1 |
+---------+----------+
This is what I have thus far,
sqldf("
SELECT
id_name,
date,
score
FROM
tbl As d
WHERE
score = (
SELECT MAX(score)
FROM tbl As b
WHERE
date >= (
SELECT MIN(date)
FROM tbl
WHERE id_name = b.id_name
) AND
date <= (
SELECT MAX(date)
FROM tbl
WHERE id_name = b.id_name
)
)
")
Problem is that it is returning the rows with the global max value irrespective of the current row value
Thanks!

I think a correlated subquery in the WHERE clause will fit the bill here:
SELECT id_name, date
FROM tbl as t1
WHERE score = (SELECT max(score) FROM tbl WHERE id_name = t1.id_name)

Combine 2 tables which doesn't have any relationship

I have couple of tables like below-
Table1:
A B C D <<Columns
1 2 3 4 <<single row
Table2:
W X Y Z << Columns
5 6 7 8 << Single row
I want to combine these 2 tables such a way that it will give me following result
Result:
P Q R S << Column headers
1 2 3 4 << row from table1
5 6 7 8 << row from table2
Expected result will have column headers as P, Q, R, S and row from table1 and row from table2
How to achieve this using SQL?

UNION ALL will not eliminate duplicates
In set operations (UNION / INTERSECT / EXCEPT) the aliases are taken from the first query (Currently I'm aware of only one exception- Hive requires the aliases to be the same for all queries - I consider this as a bug)
select A as P, B as Q, C as R, D as S
from table1
union all
select W,X,Y,Z
from table2
+---+---+---+---+
| p | q | r | s |
+---+---+---+---+
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
+---+---+---+---+
table2 with 3 Columns
select B as Q, C as R, D as S
from table1
union all
select X,Y,Z
from table2
+---+---+---+
| q | r | s |
+---+---+---+
| 2 | 3 | 4 |
| 6 | 7 | 8 |
+---+---+---+
or
select A as P, B as Q, C as R, D as S
from table1
union all
select null,X,Y,Z
from table2
+--------+---+---+---+
| p | q | r | s |
+--------+---+---+---+
| 1 | 2 | 3 | 4 |
| (null) | 6 | 7 | 8 |
+--------+---+---+---+

_Updated to be more strict and more complete, thanks to #AntDC (and #Matt) and #Dudu Markovitz__
Use UNION with aliases, like this:
SELECT A AS P, B AS Q, C AS R, D AS S
FROM table1
UNION
-- or UNION ALL if you want to keep duplicate rows
SELECT W, X, Y, Z
FROM table2

PostgreSQL distinct rows joined with a count of distinct values in one column

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:
a | b | u | t
-----+---+----+----
foo | 1 | 1 | 10
foo | 1 | 2 | 11
foo | 1 | 2 | 11
foo | 2 | 4 | 1
foo | 3 | 5 | 2
bar | 1 | 6 | 2
bar | 2 | 7 | 2
bar | 2 | 8 | 3
bar | 3 | 9 | 4
bar | 4 | 10 | 5
bar | 5 | 11 | 6
baz | 1 | 12 | 1
baz | 1 | 13 | 2
baz | 1 | 13 | 2
baz | 1 | 13 | 3
There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.
I'm trying to build a query which will return the following results:
a | b | u | t | z
-----+---+----+----+---
foo | 1 | 1 | 10 | 3
foo | 1 | 2 | 11 | 3
foo | 2 | 4 | 1 | 3
foo | 3 | 5 | 2 | 3
bar | 1 | 6 | 2 | 5
bar | 2 | 7 | 2 | 5
bar | 2 | 8 | 3 | 5
bar | 3 | 9 | 4 | 5
bar | 4 | 10 | 5 | 5
bar | 5 | 11 | 6 | 5
In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a, and only rows with a z value greater than 2 are included.
I can get just the z filter working as follows:
SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
HAVING COUNT(b) > 2;
However, I'm stumped on combining this with the rest of the data in the table.
What's the most efficient way to do this?

Your first step can be simpler already:
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2;
Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.
Since your table is big, you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental but - assuming b, u, and t are small columns - an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).
Your desired end result:
SELECT DISTINCT ON (md5(t.a), b, u, t)
t.a, t.b, t.u, t.t, a.z
FROM (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) a
JOIN t ON md5(t.a) = md5_a
ORDER BY 1, 2, 3, 4; -- optional
Or probably faster, yet:
SELECT a, b, u, t, z
FROM (
SELECT DISTINCT ON (1, 2, 3, 4)
md5(t.a) AS md5_a, t.b, t.u, t.t, t.a
FROM t
) t
JOIN (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) z USING (md5_a)
ORDER BY 1, 2, 3, 4; -- optional
Detailed explanation for DISTINCT ON:
Select first row in each GROUP BY group?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Joining two datasets in spark sql (java) - apache-spark-sql

you can do the join with the following. You just need to specify the list of keys when they corresspond val df1 = Seq( (1, 2, 3, 4) ).toDF("a", "b", "c", "d") val df2 = Seq( (1, 2, 3, 4) ).toDF("a", "b", "c", "e") val df3 = df1.join(df2, Seq("a", "b", "c"), "left")

Related

How to get the frequency of each element in array from all rows in table in PostgreSql

How to use SQL to flatten a table with arrays in multiple columns?

Get max for each row between two dates?

Combine 2 tables which doesn't have any relationship

PostgreSQL distinct rows joined with a count of distinct values in one column

Categories

Resources