hive: join with regex - hive

I'd like to implement a join with regex/rlike condition. But Hive doesn't do inequality joins
select a.col_1, b.col_2
from table1 a left join table2 b
on a.col_1 rlike b.col_2
This actually works, but I want to match the full text in b.col2 to a string in a.col_1. Is there a way to do this ?
example dataset:
**table1**
apple iphone
apple iphone 6s
google nexus
samsung galaxy tab
**table2**
apple
google
nexus
**outcome**
col1 col2
apple iphone apple
apple iphone 6s apple
google nexus google
samsung galaxy tab null

select col1
,col2
from (select t1.col1
,t2.col2
,count (col2) over (partition by col1) as count_col2
,row_number () over (partition by col1,col2) as rn
from (select *
from table1 t1
lateral view explode(split(col1,'\\s+')) e as token
) t1
left join (select *
from table2 t2
lateral view explode(split(col2,'\\s+')) e as token
) t2
on t2.token =
t1.token
) t
where ( count_col2 = 0
or col1 rlike concat ('\\b',col2,'\\b')
)
and rn = 1
;
+--------------------+--------+
| col1 | col2 |
+--------------------+--------+
| apple iphone | apple |
| apple iphone 6s | apple |
| google nexus | google |
| google nexus | nexus |
| samsung galaxy tab | (null) |
+--------------------+--------+

Related

SQL Retrieve latest value relative to a subset

I'm trying to run a SQL query that pulls in latest result relative to a subset.
GIVEN Device 1 is linked to Device 3, I would like the query to show Device 1's color as Device 3's because Device 1 is linked to Device 3 and Device 3's color has been changed most recently.
Current results:
Device | Color
-------+---------
1 | red
2 | blue
3 | green
What I would like:
Device | Color
-------+---------
1 | GREEN
2 | blue
3 | GREEN
DeviceTable
account_id
----------
1
2
3
BeaconsTable
account_id | Color | Time
-----------+--------+--------
1 | red | 6:00
2 | blue | 7:00
3 | red | 8:00
3 | green | 10:00
LinkTable
account_id | Link
------------+--------
1 | 3
2 | -
3 | -
Here's what I am doing right now and again, this results in the latest color for the device, but not for the linked device.
SELECT d.* , b.*
FROM Device d
JOIN Beacons b ON d.account_id = b.account_id
JOIN (
SELECT b.account_id,
MAX(b.timestamp) as max_date
FROM Beacons b
GROUP BY b.account_id) x ON x.account_id = d.account_id
AND x.max_date = b.timestamp
If I am following this logic correctly, you want to join to the links and use the link there, under some circumstances:
SELECT COALESCE(l.link, d.account_id) as device, b.color
FROM Device d LEFT JOIN
Links l
ON d.account_id = l.account_id LEFT JOIN
(SELECT b.*,
ROW_NUMBER() OVER (PARTITION BY b.account_id ORDER BY b.time DESC) as seqnum
FROM beacons b
) b
ON b.account_id = COALESCE(l.link, d.account_id);
Join the tables and then LEFT join 1 more copy of Beacons to get the potential color of the linked device.
Then use window function FIRST_VALUE() to get that color:
select distinct d.account_id,
first_value(coalesce(b2.Color, b.Color)) over (partition by d.account_id order by coalesce(b2.Time, b.Time) desc) Color
from Device d
inner join Beacons b on b.account_id = d.account_id
inner join Link l on l.account_id = d.account_id
left join Beacons b2 on b2.account_id = l.Link and b2.Time > b.Time
note that the column Time must be in the format hh:mm so it is comparable.
See the demo.
Results:
> account_id | Color
> ---------: | :----
> 1 | green
> 2 | blue
> 3 | green
This should work:
SELECT
account_id,
color
FROM Device AS d
JOIN (
SELECT
linked_id AS account_id,
color,
ROW_NUMBER() OVER (PARTITION BY linked_id ORDER BY Time DESC) AS row_number
FROM Beacons AS b
JOIN (
SELECT
account_id AS linked_id,
Link AS account_id
FROM Linked
) USING (account_id)
) USING (account_id)
WHERE row_number = 1
;
Here's a Demo

SQL Loop and Join

I have a table:
Vers | Rev
3 | A
7 | B
13 | C
And a second table:
Info | Version
aab | 1
adr | 2
bhj | 3
bgt | 4
nnh | 4
ggt | 7
I need to have a table:
Info | Version | Rev
aab | 1 | A
adr | 2 | A
bhj | 3 | A
bgt | 4 | B
nnh | 4 | B
ggt | 7 | B
How do I achieve the final table?
Rev A is for Versions 1-3, Rev B is versions 4-7, Rev C is versions 5-13.
If I were trying to do this with VB Excel, I would add a 1 in a new column. Then get the first Vers value (3) - second Vers value (7) then output 4....
Then I would use some logic If <= new column and >= Vers write Rev.
I don't know how to do this in SQL and I need to!
Try this you can do it by joining tables
select
t2.Info Info
,t2.Version Version
,t1.Rev Rev
from table1 t1,table2 t2
where t2.Version=t1.Vers;
Use outer apply:
select t2.*, t1.rev
from table2 t2 outer apply
(select top (1) t1.*
from table1 t1
where t2.version <= t1.vers
order by t1.vers asc
) t1;
This gets the "next" version in table1 relative to each version in table2.
You can also do this with a subquery:
SELECT *
, (SELECT TOP 1 b.rev
FROM Table1 b
WHERE a.version <= b.vers
ORDER BY b.vers)
FROM Table2 a
Or a third version:
declare #t1 table(V int, R char(1))
insert #t1 values (3,'A'),(7,'B'),(13,'C')
declare #t2 table(I char(3), V int)
insert #t2 values ('aab',1),('adr',2),('bhj',3),('bgt',4),('nnh',4),('ggt',7)
select t2.*, t1.R
from #t2 t2
join #t1 t1 on t1.V>=t2.V and not exists(select * from #t1 t3 where t3.V>=t2.v and t3.V<t1.V)

MSsql update column from joint table

I have 2 tables ex. TABLE1 and TABLE2
TABLE1 TABLE2
ID | SIZE | VALUE ID | SIZE | SCORE
1 | LOW | 1.0 1 | MID | 3232
2 | MID | 3.0 2 | MID | 2321
3 | HIGH | 5.0 3 | HIGH | 3232
what i want is to update TABLE2.SCORE so the values will be TABLE1.value column and the join to be SIZE.
OUTPUT:
ID | SIZE | SCORE
1 | MID | 3.0
2 | MID | 3.0
3 | HIGH | 5.0
I tried:
Update Table2 set SCORE=(select top(1) VALUE from TABLE1 join TABLE2 on table1.size=table2.size ) however this does not work I get this result:
OUTPUT:
ID | SIZE | SCORE
1 | MID | 3.0
2 | MID | 3.0
3 | HIGH | 3.0 <---- wrong
Try this
DECLARE #TABLE1 AS TABLE(ID INT , SIZE VARCHAR(10) , VALUE decimal(2,1))
INSERT INTO #TABLE1
SELECT 1 , 'LOW' , 1.0 UNION ALL
SELECT 2 , 'MID' , 3.0 UNION ALL
SELECT 3 , 'HIGH' , 5.0
DECLARE #TABLE2 AS TABLE(ID INT , SIZE VARCHAR(10) , SCORE INT)
INSERT INTO #TABLE2
SELECT 1 , 'MID' , 3232 UNION ALL
SELECT 2 , 'MID' , 2321 UNION ALL
SELECT 3 , 'HIGH' , 3232
SELECT * FROM #TABLE2
UPDATE t2
SET SCORE=t1.VALUE
FROM #TABLE2 t2 inner join #TABLE1 t1 On t1.SIZE=t2.SIZE
SELECT ID,SIZE, CAST(SCORE AS DECIMAL(2,1)) AS SCORE
FROM #TABLE2
Demo result : http://rextester.com/VFF59681
You can use a JOIN in the UPDATE:
update t2
set t2.score = t1.score
from table2 t2 join
table1 t1
on t2.size = t1.size;
You can also follow your pattern by using a correlated subquery:
update table2
set t2.score = (select t1.score from table1 t1 where t1.size = table2.size);
There is no need for another JOIN in the subquery.
update a
set a.score=b.score
from table2 a join table1 b on a.id=b.id
You don't need to do the JOIN in subquery you can directly express it as :
update table2
set score = (select top (1) t1.score from table1 t1 where t1.size = table2.size);
You can achieve it like this:
update table2
set table2.SCORE = table1.VALUE
from table2
join table1
on table2.SIZE = table1.SIZE
However, to avoid problems, you will need to make sure table1.SIZE is unique.

Identify duplicates based on multiple columns

I want to identify duplicates in a db based on multiple columns from various tables. In the example below, 1&5 and 2&4 are duplicates - as all four columns have same values. How do I identify such records using a sql? I have used group by having count>1 when I had to identify duplicates based on a single column, but I am unsure how to identify them based on multiple columns. However, I see that when I do group by having count>1 based on all 4 columns, #3 and 6 are showing up, they are technically not duplicates per my requirement.
T1
ID | Col1 | Col2
---| --- | ---
1 | A | US
2 | B | FR
3 | C | AU
4 | B | FR
5 | A | US
6 | D | UK
T2
ID | Col1
---| ---
1 | Apple
1 | Kiwi
2 | Pear
3 | Banana
3 | Banana
4 | Pear
5 | Apple
T3
ID | Col1
---| ---
1 | Spinach
1 | Beets
2 | Celery
3 | Radish
4 | Celery
5 | Spinach
6 | Celery
6 | Celery
My expected result would be:
1 A US Apple Spinach
5 A US Apple Spinach
2 B FR Pear Celery
4 B FR Pear Celery
For your sample data, you can achieve this using inner join-ing all three tables and using just group by tA.Col1 having count(tA.Col1)>1 in where clause sub-query as below to obtain your desired result.
SELECT t1.ID,
t1.Col1,
t1.Col2,
t2.Col1,
t3.Col1
FROM table1 t1
JOIN table2 t2 ON t1.ID = t2.ID
JOIN table3 t3 ON t1.ID = t3.ID
WHERE t1.Col1 IN
( SELECT tA.Col1
FROM table1 tA
GROUP BY tA.Col1
HAVING count(tA.Col1)>1)
ORDER BY t1.ID;
Result
ID Col1 Col2 Col1 Col1
-----------------------------------
1 A US Apple Spinach
2 B FR Pear Celery
4 B FR Pear Celery
5 A US Apple Spinach
You can check the demo here
Hope this will help.
The problem is your result set needs to include the ID column which is unique. So a straightforward GROUP BY ... HAVING won't cut it. This would work.
with cte as
( select t1.id
, t1.col1 as t1_col1
, t1.col2 as t1_col2
, t2.col1 as t2_col1
, t3.col1 as t3_col1
from t1
join t2 on t1.id = t2.id
join t3 on t1.id = t3.id
)
select cte.*
from cte
where (t1_col1, t1_col2, t2_col1, t3_col1) in
( select t1_col1, t1_col2, t2_col1, t3_col1
from cte
group by t1_col1, t1_col2, t2_col1, t3_col1 having count(*) > 1)
/
The use of the sub-query factoring syntax is optional, but I find it useful to signal that the subquery is used more than one in the query.
"I have encountered another scenario in the data, some of the IDs have same values in T2 and T3 and they are showing up as dups."
The duplicated IDs in the child tables cause Cartesian products in the joined subquery, which causes false positives in the main result set. Ideally you should be able to handle this by introducing additional filters on those tables to remove the unwanted rows. However, if the data quality is so poor that there are no valid rules you will have to fall back on distinct:
with cte as (
select t1.id
, t1.col1 as t1_col1
, t1.col2 as t1_col2
, t2.col1 as t2_col1
, t3.col1 as t3_col1
from t1
join ( select distinct id, col1 from t2) t2 on t1.id = t2.id
join ( select distinct id, col1 from t3) t3 on t1.id = t3.id
) ...
You can add all columns in the group by clause for which you want to find the duplicate and then write the count condition in having claus
select t1.id,t1.col1,t2.col2,t2.col3,t3.col4 from t1 join t2 on t1.id=t2.id join t3 on t3.id=t1.id where (t1.col1,t2.col2,t2.col3,t3.col4) in (
select t1.col1,t2.col2,t2.col3,t3.col4
from t1 join t2 on t1.id=t2.id join t3 on t3.id=t1.id
group by t1.col1,t2.col2,t2.col3,t3.col4
having count(*) >1 )

Merging two data sets on closest date efficiently in PostgreSQL

I try to merge two tables with different time resolution on their closest date.
The tables are like this:
Table1:
id | date | device | value1
----------------------------------
1 | 10:22 | 13 | 0.53
2 | 10:24 | 13 | 0.67
3 | 10:25 | 14 | 0.83
4 | 10:25 | 13 | 0.32
Table2:
id | date | device | value2
----------------------------------
22 | 10:18 | 13 | 0.77
23 | 10:21 | 14 | 0.53
24 | 10:23 | 13 | 0.67
25 | 10:28 | 14 | 0.83
26 | 10:31 | 13 | 0.23
I want to merge these tables along the first one. So I want to append value2 to Table1, where, for each device, the latest value2 appears.
Result:
id | date | device | value1 | value2
-------------------------------------------
1 | 10:22 | 13 | 0.53 | 0.77
2 | 10:24 | 13 | 0.67 | 0.67
3 | 10:25 | 14 | 0.83 | 0.53
4 | 10:25 | 13 | 0.32 | 0.67
I have some (20-30) devices, thousands of rows in Table2 (=m) and millions of them in Table1 (=n).
I could sort all the tables along date (O(n*logn)), write them into a text file and iterate over Table1 like a merge, while pulling data from Table2 until it is newer (I have to manage that ~20-30 pointers to the latest data for each device, but no more), and after the merge I could upload it back to the database. Then the complexities are O(n*log(n)) for sorting and O(n+m) for iterating over the tables.
But it would be much better to do it in the database at all. But the best query I could achive was O(n^2) complexity:
SELECT DISTINCT ON (Table1.id)
Table1.id, Table1.date, Table1.device, Table1.value1, Table2.value2
FROM Table1, Table2
WHERE Table1.date > Table2.date and Table1.device = Table2.device
ORDER BY Table1.id, Table1.date-Table2.date;
It's really slow for the data amount I need to process, are there better ways to do this? Or just do that stuff with the downloaded data?
Your query can be rewritten as:
SELECT DISTINCT ON (t1.id)
t1.id, t1.date, t1.device, t1.value1, t2.value2
FROM table1 t1
JOIN table2 t2 USING (device)
WHERE t1.date > t2.date
ORDER BY t1.id, t2.date DESC;
No need to calculate a date difference for every combination of rows (which is expensive and not sargable), just pick the row with the greatest t2.date from each set. Index support is advisable.
Details for DISTINCT ON:
Select first row in each GROUP BY group?
That's probably not fast enough, yet. Given your data distribution you would need a loose index scan, which can be emulated with correlated subqueries (like Gordon's query) or a more modern and versatile JOIN LATERAL:
SELECT t1.id, t1.date, t1.device, t1.value1, t2.value2
FROM table1 t1
LEFT JOIN LATERAL (
SELECT value2
FROM table2
WHERE device = t1.device
AND date < t1.date
ORDER BY date DESC
LIMIT 1
) t2 ON TRUE;
The LEFT JOIN avoids losing rows when no match is found in t2. Details:
Optimize GROUP BY query to retrieve latest row per user
But that's still not very fast, since you have "thousands of rows in Table2 and millions of them in Table1".
Two ideas, probably faster, but also more complex:
1. UNION ALL plus window functions
Combine Table1 and Table2 in a UNION ALL query and run a window function over the derived table. This is enhanced by the "moving aggregate support" in Postgres 9.4 or later.
SELECT id, date, device, value1, value2
FROM (
SELECT id, date, device, value1
, min(value2) OVER (PARTITION BY device, grp) AS value2
FROM (
SELECT *
, count(value2) OVER (PARTITION BY device ORDER BY date) AS grp
FROM (
SELECT id, date, device, value1, NULL::numeric AS value2
FROM table1
UNION ALL
SELECT id, date, device, NULL::numeric AS value1, value2
FROM table2
) s1
) s2
) s3
WHERE value1 IS NOT NULL
ORDER BY date, id;
You'll have to test if it can compete. Sufficient work_mem allows in-memory sorting.
db<>fiddle here for all three queries
Old sqlfiddle
2. PL/pgSQL function
Cursor for each device in Table2, loop over Table1, pick the value from respective device-cursor after advancing until cursor.date > t1.date and keeping value2 from the row before last. Similar to the winning implementation here:
Window Functions or Common Table Expressions: count previous rows within range
Probably fastest, but more code to write.
Because table 1 is so much smaller, it might be more efficient to use a correlated subquery:
select t1.*,
(select t2.value2
from table2 t2
where t2.device = t.device and t2.date <= t1.date
order by t2.date desc
limit 1
) as value2
from table1 t1;
Also create an index on table2(device, date, value2) for performance.